Kia ora! This report documents my analysis of the New Zealand housing market. My goal was to integrate various datasets related to property prices, rentals, and dwelling tenure to uncover key trends and build predictive models, primarily focusing on the Rent Consumer Price Index (Rent CPI). This work involved extensive data wrangling, exploratory data analysis, feature engineering, machine learning, and initial steps into time series forecasting. I aimed for a robust pipeline that could eventually support predictions out to 2030.
My main objectives for this project were: * To source, clean, and integrate disparate housing datasets into a cohesive analytical base. * To perform a comprehensive Exploratory Data Analysis (EDA) to understand historical trends, seasonality, and regional variations in key housing metrics. * To engineer relevant features that could capture underlying market dynamics and improve predictive model performance. * To build, tune, and evaluate several machine learning models to predict Rent CPI. * To explore and implement initial time series forecasting techniques for future Rent CPI prediction, comparing different methodologies.
The analysis is based on three main CSV files I obtained: *
rental.csv: This file contains rental CPI data, which
includes breakdowns by broader New Zealand regions. *
property.csv: This dataset provides median property prices
for a more granular set of locations across the country. *
dwellings.csv: This file offers national-level statistics
on dwelling tenure, such as the number of owner-occupied, rented, and
freely provided dwellings.
First things first, I need to load all the R libraries that I’ll be
using throughout this analysis. I’ve grouped them by their primary
function (data manipulation, time series, machine learning, plotting).
I’ve made a note to load plyr before dplyr if
I were to use specific plyr functions, as
dplyr (loaded via tidyverse) has some
functions with the same names. For this project, dplyr is
my main tool for data wrangling. tidymodels is a fantastic
collection of packages for a modern approach to modelling, and
caret is also very useful for its train
function and model evaluation tools.
# Data Manipulation & Core
library(plyr) # load it BEFORE dplyr to avoid conflicts.
library(tidyverse) # Includes dplyr, ggplot2, readr, etc.
library(lubridate) # For date-time manipulation.
library(janitor) # For cleaning column names.
library(zoo) # For as.yearmon, rollapply, rollmean.
# Time Series Specific
library(tseries) # For time series tests like adf.test
library(forecast) # For ARIMA, auto.arima, forecast, checkresiduals, ndiffs
# Machine Learning & Evaluation
library(tidymodels) # Meta-package for modern modelling: recipes, rsample, parsnip, yardstick, etc.
# library(recipes) # For ML preprocessing (loaded by tidymodels)
# library(rsample) # For initial_split (loaded by tidymodels)
library(caret) # For ML model training (can also use tidymodels workflows)
library(glmnet) # For Lasso/Elastic Net regression.
library(randomForest)# For Random Forest algorithm.
library(xgboost) # For XGBoost algorithm.
# library(Metrics) # For RMSE - prefer yardstick::rmse_vec from tidymodels, loaded via 'tidymodels'
# Plotting & Visualization
library(corrplot) # For correlation matrix plots.
library(ggthemes) # For additional ggplot themes.
library(scales) # For formatting axes (dollar, percent, comma).
library(RColorBrewer)# For colour palettes.
library(ggpubr) # For easily adding stats like correlation to ggplot (e.g., stat_cor).
library(ggrepel) # For geom_text_repel to avoid overlapping text labels in plots.With the environment set up, I can now proceed to load and process the data.
The first step in any analysis is getting the data in and ensuring
it’s in a clean, usable format. The column names in the raw files needed
a bit of tidying (using janitor::clean_names()), and the
time_ref column, which represents the time period, needed
to be converted into a consistent R Date object for proper
time series analysis and merging.
# Reading in my datasets.
# I'm using read_csv from readr (part of tidyverse) for fast and friendly CSV reading,
# and piping straight into clean_names() from janitor to sort out any messy column names.
rental_raw <- read_csv("rental.csv") %>% clean_names()
property_raw <- read_csv("property.csv") %>% clean_names()
dwellings_raw <- read_csv("dwellings.csv") %>% clean_names()
# It's always a good idea to have a quick look at the first few rows and structure
# of the raw data to understand what I'm dealing with.
# For this report, I'll just show a sample of one.
print("Sample of raw rental data:")## [1] "Sample of raw rental data:"
My initial check of the data showed that the time_ref
column was key for linking datasets but wasn’t in a standard date
format.
The time_ref column in the datasets was numeric (e.g.,
2007.01 for January 2007). Occasionally, it appeared as
2007.1, which I interpreted as October 2007 — effectively
2007.10. To ensure consistent parsing, I wrote a function
to convert these values into proper R Date objects set to
the first day of the respective month. This step is
essential for reliable time-based operations and accurate merging of
datasets.
# This function converts YYYY.MM-style 'time_ref' values into actual R Date objects.
# It uses sprintf to force two decimal places, so '2007.1' becomes '2007.10'.
convert_time_ref_to_date <- function(df, time_ref_col_name = "time_ref") {
df <- df %>%
mutate(
# Standardise to ensure two decimal places, e.g., 2007.1 becomes 2007.10
time_ref_numeric = as.numeric(sprintf("%.2f", .data[[time_ref_col_name]])),
# Extract the year
year = floor(time_ref_numeric),
# Extract the month by multiplying the decimal part by 100
month_val = round((time_ref_numeric %% 1) * 100),
# Create a proper Date object
date = make_date(year, month_val, 1)
) %>%
select(-time_ref_numeric) # Drop intermediate column
return(df)
}
# Apply the function to all raw datasets
rental <- convert_time_ref_to_date(rental_raw, "time_ref")
property <- convert_time_ref_to_date(property_raw, "time_ref")
dwellings <- convert_time_ref_to_date(dwellings_raw, "time_ref")
# Quick validation
print("Sample of Rental Data with new 'date' Column:")## [1] "Sample of Rental Data with new 'date' Column:"
This standardisation step is vital. With consistent date
columns across datasets, I can now reliably join and compare
observations over time. This paves the way for the next step:
integrating the different data sources.
Now that my individual datasets were loaded and had standardised date
columns, the next big step was to integrate them. This was probably one
of the trickiest parts because the property.csv data had
very specific locations, while rental.csv used broader
regional categories (like “Rest of North Island”). To combine these
meaningfully for a regional analysis, I had to:
property.csv would
roll up into the broader rental regions.dwellings.csv data, which is national and
quarterly, needed to be joined to my monthly regional data. I decided to
do this by mapping each month to its respective quarter and then filling
the national dwelling figures forward across all months within that
quarter.Here’s how I implemented that:
# First, I defined which specific property locations would make up my "Rest of..." rental regions.
# These are based on standard New Zealand regional classifications.
north_island_rest_locations <- c("Northland", "Waikato", "Bay of Plenty", "Gisborne", "Hawke's Bay", "Taranaki", "Manawatu")
south_island_rest_locations <- c("Tasman", "Nelson", "Marlborough", "West Coast", "Otago", "Southland")
# 1. Aggregating property data for "Rest of North Island"
# I'm grouping by date and other original time columns to ensure the mean is calculated correctly for each period.
property_roni <- property %>%
filter(location %in% north_island_rest_locations) %>%
group_by(date, year, month_val, time_ref) %>%
summarise(median_price = mean(median_price, na.rm = TRUE), .groups = "drop") %>% # Using mean of median_prices
mutate(region = "Rest of North Island")
# 2. Aggregating property data for "Rest of South Island"
property_rosi <- property %>%
filter(location %in% south_island_rest_locations) %>%
group_by(date, year, month_val, time_ref) %>%
summarise(median_price = mean(median_price, na.rm = TRUE), .groups = "drop") %>%
mutate(region = "Rest of South Island")
# 3. Preparing property data for regions that have a direct name match, and for the "National" aggregate.
property_direct_match <- property %>%
filter(location %in% c("Auckland", "Wellington", "Canterbury", "NZ total")) %>%
mutate(region = case_when(
location == "NZ total" ~ "National", # Mapping "NZ total" from property data to "National"
TRUE ~ location # Auckland, Wellington, Canterbury names are kept as is.
)) %>%
select(date, year, month_val, time_ref, region, median_price) # Selecting only the columns I need.
# Now, I'm combining these processed property datasets into one table called 'property_mapped'.
property_mapped <- bind_rows(
property_direct_match,
property_roni,
property_rosi
) %>%
select(date, region, median_price, year, month_val, time_ref) %>% # Standardising columns before the join.
distinct(date, region, .keep_all = TRUE) # Just a safeguard to ensure no duplicate rows per date/region.
# Quick look at what property_mapped contains for a specific region.
print("Sample of mapped property data for Auckland:")## [1] "Sample of mapped property data for Auckland:"
## # A tibble: 6 × 6
## date region median_price year month_val time_ref
## <date> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 1992-01-01 Auckland 135000 1992 1 1992.
## 2 1992-02-01 Auckland 142000 1992 2 1992.
## 3 1992-03-01 Auckland 135000 1992 3 1992.
## 4 1992-04-01 Auckland 135000 1992 4 1992.
## 5 1992-05-01 Auckland 135000 1992 5 1992.
## 6 1992-06-01 Auckland 137000 1992 6 1992.
## [1] "Sample of mapped property data for Rest of North Island:"
## # A tibble: 6 × 6
## date region median_price year month_val time_ref
## <date> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 1992-01-01 Rest of North Island 90000 1992 1 1992.
## 2 1992-02-01 Rest of North Island 91286. 1992 2 1992.
## 3 1992-03-01 Rest of North Island 90571. 1992 3 1992.
## 4 1992-04-01 Rest of North Island 91000 1992 4 1992.
## 5 1992-05-01 Rest of North Island 91143. 1992 5 1992.
## 6 1992-06-01 Rest of North Island 88786. 1992 6 1992.
With the property data now mapped to the same regional definitions as my rental data, I could proceed with merging them. Then, I integrated the national dwellings data.
# Joining my rental data with the 'property_mapped' data.
# Using a left_join to keep all rental records and bring in matching property prices.
merged_rp <- rental %>%
left_join(property_mapped, by = c("date", "region"), suffix = c("_rent", "_prop")) %>%
# The join might create suffixed columns if original names (like year, month_val) existed in both.
# I am cleaning these up, prioritising the ones from the 'rental' dataframe (which I aliased with _rent).
select(-any_of(c("year_prop", "month_val_prop", "time_ref_prop", "time_ref_rent.y", "year.y", "month_val.y"))) %>%
rename(year = year_rent, month_val = month_val_rent, time_ref = time_ref_rent)
# Next, I prepared the quarterly dwellings data for joining with my monthly 'merged_rp' data.
merged_rp <- merged_rp %>%
mutate(
quarter_val = quarter(date), # Figuring out the quarter for each month.
# Creating a join key: the first month of whatever quarter each date falls into.
dwelling_join_month = case_when(
quarter_val == 1 ~ 1, quarter_val == 2 ~ 4,
quarter_val == 3 ~ 7, quarter_val == 4 ~ 10
),
dwelling_date_join = make_date(year, dwelling_join_month, 1)
)
# Selecting and renaming columns from dwellings data for a clean join.
dwellings_for_join <- dwellings %>%
select(date_dwelling_source = date, owner_occupied, provided_free, rented, total)
# Performing the join to get the final 'full_data' table.
# Dwelling data is quarterly, so I'm using fill() to propagate these national figures
# to all months within that quarter for each region.
full_data <- merged_rp %>%
left_join(dwellings_for_join, by = c("dwelling_date_join" = "date_dwelling_source")) %>%
# Grouping by year and quarter ensures fill operates correctly within each period.
group_by(year, quarter_val) %>%
fill(owner_occupied, provided_free, rented, total, .direction = "downup") %>%
ungroup() %>% # Ungrouping is important after grouped operations.
filter(!is.na(total)) # Ensuring only rows with successfully joined dwelling data are kept.
# Let's check the structure and a sample of the final merged data.
print("Structure of full_data:")## [1] "Structure of full_data:"
## tibble [984 × 14] (S3: tbl_df/tbl/data.frame)
## $ time_ref : num [1:984] 2006 2006 2007 2007 2007 ...
## $ rent_cpi : num [1:984] 1000 995 1007 1012 1020 ...
## $ region : chr [1:984] "Auckland" "Auckland" "Auckland" "Auckland" ...
## $ year : num [1:984] 2006 2006 2007 2007 2007 ...
## $ month_val : num [1:984] 11 12 1 2 3 4 5 6 7 8 ...
## [list output truncated]
## [1] "Sample of final full_data:"
## # A tibble: 6 × 6
## date region rent_cpi median_price owner_occupied total
## <date> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 2006-11-01 Auckland 1000 425000 1098600 1649200
## 2 2006-12-01 Auckland 995 425000 1098600 1649200
## 3 2007-01-01 Auckland 1007 419500 1101500 1655400
## 4 2007-02-01 Auckland 1012 435000 1101500 1655400
## 5 2007-03-01 Auckland 1020 441000 1101500 1655400
## 6 2007-04-01 Auckland 1023 452000 1104000 1661100
This merging process was a bit fiddly, especially getting the
regional mapping and the quarterly-to-monthly join right, but I think
full_data now correctly combines all the information I need
for the next steps of feature engineering and analysis.
With the full_data table now properly integrated, my
next step was to engineer some new features. My thinking here was that
creating more specific indicators from the existing data could help
uncover deeper insights during EDA and provide more meaningful inputs
for the predictive models. The features I decided to create include:
median_price and rent_cpi can help
stabilise variance and linearise relationships, which is often
beneficial for modelling.region and arranging by
date to ensure the lags and growth rates were correctly
computed for each region’s specific time series.After creating these new features, especially the lagged and rolling
ones, some initial rows for each region would have NA
values (e.g., you can’t have a 3-month rolling average for the first two
months of data). I filtered these out to ensure the dataset was clean
for the subsequent analysis.
# Now I'm creating those new features I talked about.
full_data <- full_data %>%
group_by(region) %>% # Grouping by region is crucial for lag and rolling calculations
arrange(date) %>% # Making sure data is sorted by date within each region first
mutate(
# National dwelling ratios (these are from national data, so they'll be the same for all regions in a given quarter)
rented_ratio_national = rented / total,
owner_ratio_national = owner_occupied / total,
free_ratio_national = provided_free / total,
# Regional Price-to-Rent CPI ratio. Checking for zeros or NAs to avoid errors.
price_rent_cpi_ratio = ifelse(rent_cpi > 0 & !is.na(median_price) & median_price > 0, median_price / rent_cpi, NA),
# Log transformations for potentially skewed variables.
log_median_price = ifelse(!is.na(median_price) & median_price > 0, log(median_price), NA),
log_rent_cpi = ifelse(!is.na(rent_cpi) & rent_cpi > 0, log(rent_cpi), NA),
# Lagged values (previous month's value) for calculating growth rates.
median_price_lag = lag(median_price, 1),
rent_cpi_lag = lag(rent_cpi, 1),
# Monthly growth rates - (current - previous) / previous.
price_growth_monthly = ifelse(!is.na(median_price_lag) & median_price_lag != 0, (median_price - median_price_lag) / median_price_lag, NA),
rent_growth_monthly = ifelse(!is.na(rent_cpi_lag) & rent_cpi_lag != 0, (rent_cpi - rent_cpi_lag) / rent_cpi_lag, NA),
# 3-month rolling averages to smooth out short-term fluctuations.
# align = "right" means the average is calculated using the current and past two periods.
rent_cpi_roll_avg3 = rollmean(rent_cpi, 3, fill = NA, align = "right", na.rm = TRUE),
median_price_roll_avg3 = rollmean(median_price, 3, fill = NA, align = "right", na.rm = TRUE)
) %>%
ungroup() %>% # Important to ungroup after grouped mutations.
# Removing initial rows that would have NAs due to lag/rollmean calculations.
# Using month(2) filter ensures that for each group, at least 2 prior records were available for the 3-month rolling mean.
filter(date >= (min(full_data$date[which(!is.na(full_data$owner_occupied))], na.rm=TRUE) %m+% months(2)))
print(paste("Final full_data rows after Feature Engineering and initial NA filter:", nrow(full_data)))## [1] "Final full_data rows after Feature Engineering and initial NA filter: 972"
# Just having a quick look at some of the new features to make sure they look right.
print("Sample of full_data with some engineered features:")## [1] "Sample of full_data with some engineered features:"
print(head(select(full_data, date, region, rent_cpi, median_price, price_growth_monthly, rent_growth_monthly, price_rent_cpi_ratio, rent_cpi_lag)))## # A tibble: 6 × 8
## date region rent_cpi median_price price_growth_monthly rent_growth_monthly price_rent_cpi_ratio rent_cpi_lag
## <date> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2007-01-01 Auckland 1007 419500 -0.0129 0.0121 417. 995
## 2 2007-01-01 Wellington 1010 350000 -0.0278 0.0423 347. 969
## 3 2007-01-01 Rest of N… 1003 280571. 0.0345 0.0277 280. 976
## 4 2007-01-01 Canterbury 1023 294800 -0.000678 -0.0173 288. 1041
## 5 2007-01-01 Rest of S… 1052 257333. 0.0114 0.0746 245. 979
## 6 2007-01-01 National 1025 325500 -0.0136 0.0333 318. 992
These engineered features should give me more angles to analyse the data from in the EDA, and hopefully provide some stronger signals for the machine learning models later on.
With my full_data table prepared and features
engineered, I was ready to dive into some Exploratory Data Analysis. My
main goal here was to visually inspect the data, look for trends,
patterns, relationships, and any potential outliers or issues that might
affect my later modelling. I planned to look at time series trends,
distributions of key variables, correlations, and specific comparisons
between metrics like rent, price, and dwelling types.
First, I’ll make sure my global plotting theme is set. I want my plots to have a consistent, professional look with a light grey background for better readability.
I started my exploratory data analysis by looking at how the Rent Consumer Price Index (Rent CPI) has changed over time. My aim here was to get a general feel for the rental market trends across the different broad regions I’d defined. I wanted to see the overall direction of these trends and quickly identify if any regions behaved noticeably differently from the others or the national average.
# Plot 1: Rent CPI Trend
# Taking the 'full_data' and filtering for the key regions I defined,
# and ensuring there's actual rent_cpi data to plot.
plot1_data <- full_data %>%
filter(region %in% key_regions_focus & !is.na(rent_cpi))
if (nrow(plot1_data) > 0) {
plot1 <- ggplot(plot1_data, aes(x = date, y = rent_cpi, color = region, group = region)) +
geom_line(alpha = 0.9, linewidth = 1) + # Linewidth at 1 for clear lines
# Using a nice colour palette from RColorBrewer. 'Region:' for the legend title.
scale_colour_brewer(palette = "Dark2", name = "Region:") +
# Formatting the y-axis to show numbers with commas for better readability.
scale_y_continuous(labels = comma) +
labs(
title = "Monthly Rent CPI Over Time by Broad Region",
x = "Date",
y = "Rent CPI"
) +
# Making sure the legend is in a single row if it's at the bottom (as per my global theme).
guides(color = guide_legend(nrow = 1))
# This plot will use the global theme_set I defined earlier,
# so it should have the light grey background and professional styling.
print(plot1)
} else {
# Just in case there's no data for some reason after filtering.
print("Plot 5.1 (Original Plot 1): No data available after filtering.")
}Plot 5.1: Monthly Rent CPI Over Time by Broad Region.
Looking at this first plot, it’s quite clear that there’s been a general upward trend in Rent CPI across all the broad regions I am looking at. Auckland consistently shows a higher CPI level compared to other regions throughout the entire period. The “National” trend line, as I would expect, sits somewhere in the middle, acting like an average. It’s also interesting to see how the “Rest of North Island” and “Rest of South Island” track fairly closely to each other, generally sitting below the main centers like Auckland and Wellington. This gives me a good baseline understanding of rental cost movements.
Following a similar approach to the Rent CPI, I then plotted the time series for Median Property Prices across the same broad regions. I was particularly interested to see if property prices showed more volatility or different growth patterns compared to rents. It’s also important to remember that for the “Rest of North Island” and “Rest of South Island” categories, these median prices are an average I calculated from several constituent locations. The “National” trend is based on the “NZ total” from the original property data.
# Plot 2: Median Property Price Trend
# Again, filtering for my key regions and ensuring there's median_price data.
plot2_data <- full_data %>%
filter(region %in% key_regions_focus & !is.na(median_price))
if (nrow(plot2_data) > 0) {
plot2 <- ggplot(plot2_data, aes(x = date, y = median_price, color = region, group = region)) +
geom_line(alpha = 0.9, linewidth = 1) + # Consistent line styling
# Using a different colour palette ("Set1") to distinguish from the Rent CPI plot.
scale_colour_brewer(palette = "Set1", name = "Region:") +
# Formatting the y-axis to show prices in thousands of dollars for better readability.
scale_y_continuous(labels = dollar_format(prefix = "$", scale = 1/1000, suffix = "K")) +
labs(
title = "Monthly Median Property Price by Broad Region",
subtitle = "'Rest of...' regions use averaged prices. 'National' uses 'NZ total'.",
x = "Date",
y = "Median Property Price"
) +
guides(color = guide_legend(nrow = 1)) # Ensuring legend is neat.
# This plot also uses the global theme_set established earlier for the grey background and general styling.
print(plot2)
} else {
print("Plot 5.2 (Original Plot 2): No data available after filtering.")
}Plot 5.2: Monthly Median Property Price by Broad Region.
Just as I suspected, property prices show much more dramatic changes and volatility. Auckland’s median price, in particular, has seen very significant increases, far outpacing other regions and the national average, especially in certain periods. The other regions also trend upwards but at different rates and with different peaks and troughs.
After looking at how Rent CPI and Median Property Prices changed over time, I wanted a quick way to see if there were any straight-line relationships between some of my main numbers. A correlation matrix is pretty handy for this. It shows a grid where each cell tells you how strongly two variables move together – a positive number means they tend to go up or down together, a negative number means one goes up as the other goes down, and a number close to zero means not much of a straight-line link.
I picked some of the original metrics like rent_cpi and
median_price, plus some of the new ones I made, like the
price_rent_cpi_ratio and the monthly growth rates.
# Plot 3: Correlation Matrix
# Note: corrplot is not a ggplot2 based plot, so my global theme_set for grey backgrounds won't automatically apply here.
# However, I have added a 'bg = "grey95"' argument directly to the corrplot function
# to try and keep the look somewhat consistent with the other plots.
numeric_cols_for_corr <- c("rent_cpi", "median_price", "rented_ratio_national", "owner_ratio_national",
"price_rent_cpi_ratio", "price_growth_monthly", "rent_growth_monthly")
correlation_data <- full_data %>%
select(all_of(numeric_cols_for_corr)) %>%
drop_na() # Correlation can only be calculated on complete pairs of data.
if(nrow(correlation_data) > 1 && ncol(correlation_data) > 1) {
cor_matrix <- cor(correlation_data) # Calculating the correlation coefficients
# Now plotting the matrix.
# 'method = "color"' shows correlations with colours and shades.
# 'type = "upper"' just shows the upper triangle to avoid repetition.
# 'order = "hclust"' reorders the variables to group similar ones together.
# 'addCoef.col' puts the actual correlation numbers on the plot.
corrplot(cor_matrix,
method = "color",
type = "upper",
order = "hclust",
addCoef.col = "black",
tl.col = "black", # Text label colour
tl.srt = 45, # Text label string rotation
tl.cex = 0.8, # Text label character expansion (size)
number.cex = 0.7, # Correlation coefficient number size
diag = FALSE, # Don't show the diagonal
col = brewer.pal(n=10, name="RdYlBu"), # A nice colour palette
bg = "grey95") # Setting a light grey background for the plot panel
# Adding a title manually as corrplot doesn't integrate with ggplot theming for titles easily.
title("Correlation Matrix of Key Housing Variables (Monthly Data)", line = 3, cex.main=1.1, col.main="grey10")
} else {
print("Plot 5.3 (Original Plot 3): Not enough data for correlation plot after dropping NAs from selected columns.")
}Plot 5.3: Correlation Matrix of Key Housing Variables.
This matrix is quite useful. Straight away, I can see that
rent_cpi and median_price have a pretty strong
positive correlation (the blue colour and the number (around 0.7 or 0.8,
depending on the exact data slice) show this). This means they generally
tend to trend in the same direction. The
price_rent_cpi_ratio is, as you would expect, strongly
linked to median_price (because it’s part of the
calculation) and has a weaker link with rent_cpi. The
monthly growth rates (price_growth_monthly,
rent_growth_monthly) don’t show super strong correlations
with the absolute levels of prices or rents, which makes sense because
growth can be up and down even if overall levels are high or low. This
plot is useful for spotting if any of my predictors for the later
machine learning models are too highly correlated with each other, which
might cause issues.
After looking at the time series trends, I wanted to understand the
actual distribution of Rent CPI values for each key region. A histogram
overlaid with a density plot, faceted by region, seemed like a good way
to visualise this. I used scales = "free" for the facets
because the Rent CPI ranges and distributions can be quite different
across regions, and I wanted to see the individual distributions
clearly.
# Plot 4 in my R script: Distribution of Rent CPI
# Filtering for the key regions and where rent_cpi is not NA.
plot4_data <- full_data %>%
filter(region %in% key_regions_focus & !is.na(rent_cpi))
if(nrow(plot4_data) > 0){
plot4 <- ggplot(plot4_data, aes(x = rent_cpi)) +
# Histogram shows the counts in different bins. Using density on y-axis to overlay density curve.
geom_histogram(aes(y = after_stat(density), fill = region), alpha = 0.8, bins = 25, show.legend = FALSE) +
# Density plot gives a smoother look at the distribution shape.
geom_density(aes(color = region), linewidth = 0.9, show.legend = FALSE) +
# Using different colour palettes for fill and the line for clarity.
scale_fill_brewer(palette = "Pastel1") +
scale_color_brewer(palette = "Set1") +
# Faceting by region, with free scales so each plot adapts to its own data range.
facet_wrap(~region, scales = "free") +
labs(
title = "Distribution of Rent CPI by Broad Region",
x = "Rent CPI",
y = "Density"
) +
theme(strip.text = element_text(size=rel(0.85))) # Adjusting facet label size.
# This plot uses the global theme_set.
print(plot4)
} else {
print("Plot 5.4 (Original Plot 4): No data available after filtering.")
}Plot 5.4: Distribution of Rent CPI by Broad Region.
The analysis explores how the
Rent Consumer Price Index (CPI) is distributed across
various New Zealand regions using histograms and
kernel density estimations. Rent market behavior differs
significantly by region.
Auckland:
Rent CPI shows multimodality, meaning
several peaks are present.dispersion, indicating a wide range of
rent index values.Canterbury:
unimodal pattern with a clear peak.mode is around 1350–1400 CPI.variance than Auckland, indicating more uniform
rent changes across the region.Wellington:
mode around 1050–1150 CPI, showing
many lower rent index increases.mode at higher CPI values indicates a
minority with larger increases.Rest of North Island:
positively skewed, with a long tail to the
right.mode lies between 1050–1150 CPI.Rest of South Island:
unimodal with central tendency between
1100–1200 CPI.Medium dispersion suggests relatively stable rental
market behavior.National:
unimodal shape with the main mode
around 1100–1200 CPI.medium dispersion.Similar to what I did for Rent CPI, I wanted to see the distribution
of median property prices across the regions. I was expecting these to
be even more skewed than the Rent CPI, especially for regions like
Auckland, because property prices have seen some really big jumps as
shown in the earlier time series plots. Using
scales = "free" on the facets is essential here due to the
wide variation in price levels between regions.
# Plot 5 in your R script: Distribution of Median Property Prices
plot5_data <- full_data %>%
filter(region %in% key_regions_focus & !is.na(median_price))
if(nrow(plot5_data) > 0){
plot5 <- ggplot(plot5_data, aes(x = median_price)) +
geom_histogram(aes(y = after_stat(density), fill = region), alpha = 0.8, bins = 25, show.legend = FALSE) + # Using after_stat()
geom_density(aes(color = region), linewidth = 0.9, show.legend = FALSE) +
scale_fill_brewer(palette = "Pastel2") +
scale_color_brewer(palette = "Set2") +
# Formatting x-axis as thousands of dollars for better readability.
scale_x_continuous(labels = dollar_format(prefix="$", scale=1/1000, suffix="K")) +
facet_wrap(~region, scales = "free") + # Using 'free' scales as price ranges vary a lot by region
labs(
title = "Distribution of Median Prices by Broad Region",
# Adding ($K) to x-axis label for clarity as per scale_x_continuous
x = "Median Price ($K)",
y = "Density"
) +
theme(strip.text = element_text(size=rel(0.85)))
print(plot5)
} else {
print("Plot 5.5 (Original Plot 5): No data after filtering.")
}Plot 5.5: Distribution of Median Prices by Broad Region.
Auckland
The Auckland median property price data shows clear
multimodality, with at least two main peaks. One
mode (common price range) is around
$500K-$600K and another, more prominent mode
is near $800K-$900K. The data has a wide
dispersion, meaning prices are very spread out, reflecting
a diverse and segmented property market.
Canterbury
Canterbury’s median property price data appears bimodal
(two main peaks). One mode is around $350K,
and a more significant mode is visible around the
$450K mark. This suggests two common price clusters in the
Canterbury region’s property market.
Wellington Wellington’s median property
price data appears broadly unimodal, with its main
mode or central tendency around
$400K-$500K. The distribution is quite spread
out, with a tail extending towards higher prices,
indicating some variability but a clear concentration in that
mid-range.
Rest of North Island
This region’s median property price data is strongly
unimodal and positively skewed (the tail of
the data goes to the right, towards higher prices). The
mode is clearly at the lower end of the price scale, around
$275K-$300K. This indicates that most properties in these
areas have lower median prices.
Rest of South Island
Similar to the Rest of North Island, the median property price data
here is also strongly unimodal and
positively skewed. The mode is also
concentrated at the lower end, around $275K-$300K. This
suggests that median property prices in these parts of the South Island
are generally lower.
National
The median property price data for the whole country (National) is
broadly spread, possibly unimodal but with a wide base, or
slightly multimodal. The main concentration of prices seems
to be between $350K and $500K, with a
tail showing some much higher prices. This reflects the
average of all diverse regional markets.
After looking at their individual distributions and time trends, I wanted to directly compare Rent CPI against Median Property Prices for each region. My idea was to see if regions with higher property prices also tend to have higher rents, and how tightly these two are linked. I’ve added a linear regression line (the dashed line) to each plot to show the general trend, and also included the Pearson’s R-squared value and p-value to get a statistical idea of the strength and significance of this linear relationship.
# Plot 6 in your R script: Rent CPI vs Median Price, with correlation
plot6_data <- full_data %>%
filter(region %in% key_regions_focus & !is.na(median_price) & !is.na(rent_cpi))
if(nrow(plot6_data) > 0){
plot6 <- ggplot(plot6_data, aes(x = median_price, y = rent_cpi)) +
# Points are slightly transparent and coloured by region (though faceting also separates them).
geom_point(aes(color = region), alpha = 0.4, show.legend = FALSE, size=1) +
# Adding a linear model smooth line without confidence interval for a cleaner look.
geom_smooth(method = "lm", se = FALSE, color = "black", linetype = "dashed", linewidth=0.6) +
# Using ggpubr to add R-squared and p-value directly onto the plot facets.
# Using after_stat() for labels as per newer ggplot2/ggpubr versions.
# Customising the label to show R^2 and a more readable p-value.
ggpubr::stat_cor(
aes(label = paste(after_stat(rr.label),
if_else(readr::parse_number(after_stat(p.label)) < 0.001,
"p < 0.001", # Displaying very small p-values nicely
paste("p =", scales::pvalue(readr::parse_number(after_stat(p.label)), accuracy=0.001, add_p = TRUE)) # Formatting p-value
),
sep = "~`,`~")), # Comma separation for R^2 and p
method = "pearson",
label.x.npc = "left", label.y.npc = "top", # Positioning the stats text
r.digits = 2, p.digits = 3, # Number of digits for R-squared and p-value
size = 3.0, # Text size for the stats
colour = "grey10" # Text colour
) +
scale_x_continuous(labels = dollar_format(prefix="$", scale=1/1000, suffix="K")) +
scale_color_brewer(palette = "Dark2") + # Colour palette for points (if shown without faceting)
facet_wrap(~region, scales = "free") + # 'free' scales are crucial here for both axes
labs(
title = "Rent CPI vs. Median Property Price",
subtitle = "Relationship by Broad Region, with Pearson's R-squared and p-value",
x = "Median Property Price ($K)",
y = "Rent CPI"
) +
theme(strip.text = element_text(size=rel(0.85))) # Adjusting facet label size
print(plot6)
} else {
print("Plot 5.6 (Original Plot 6): No data after filtering.")
}Plot 5.6: Rent CPI vs. Median Property Price by Broad Region, with Correlation Statistics.
Auckland
In Auckland, there is a strong positive linear relationship between
Median Property Price and Rent CPI, with a
Pearson correlation coefficient (R) of 0.97.
This high R-value indicates that as median property prices increase,
Rent CPI also tends to increase in a very consistent, linear way. The
data points are tightly clustered around the positive sloping trend
line.
Canterbury
Canterbury also shows a strong positive linear relationship, with an
R-value of 0.92. While still a very strong
correlation, it’s slightly less than Auckland’s. This means that higher
median property prices are strongly associated with higher Rent CPI, and
the data points follow the positive trend line closely.
Wellington
Wellington also demonstrates a strong positive linear relationship
between Median Property Price and Rent CPI,
with an R-value of 0.96. This high correlation
suggests that increases in median property prices are closely linked
with increases in Rent CPI, following a clear linear pattern.
Rest of North Island
This region exhibits a strong positive linear relationship, indicated
by an R-value of 0.96. This high correlation
means there’s a consistent tendency for Rent CPI to be higher when
median property prices are higher. The data points align closely with
the upward sloping trend line.
Rest of South Island
Similar to other regions, the Rest of South Island shows a strong
positive linear relationship, with an R-value of
0.94. This indicates a strong association where higher
median property prices correspond with higher Rent CPI. The points are
generally well-aligned with the positive trend.
National
For the entire country (National), the correlation between
Median Property Price and Rent CPI is
extremely strong and positive, with an R-value of
0.98. This suggests a very consistent linear trend
nationwide: as property prices go up, rent CPIs also tend to go up. The
scatter of points is minimal around the trend line.
Next, I wanted to get a sense of the broader housing situation in New Zealand by looking at the national trends in dwelling tenure. I was curious to see how the proportions of rented, owner-occupied, and freely provided dwellings have changed over the years. This provides important context for the rental market. I used a stacked area chart to show how these different tenure types contribute to the total housing stock over time.
# Plot 7 in my R script: National Dwelling Composition - Stacked Area
# These ratios (rented_ratio_national, etc.) are based on national dwelling data,
# so they are the same for all regions at a given point in time.
# I only need one set of these national ratios per date for plotting.
national_dwelling_ratios_long <- full_data %>%
filter(!is.na(rented_ratio_national) & !is.na(owner_ratio_national) & !is.na(free_ratio_national)) %>%
# distinct() keeps one row per date, effectively giving me the national trend.
distinct(date, .keep_all = TRUE) %>%
# Renaming for prettier legend labels in the plot.
select(date,
`Rented` = rented_ratio_national,
`Owner Occupied` = owner_ratio_national,
`Provided Free` = free_ratio_national) %>%
# Pivoting the data longer to make it suitable for a stacked area chart with ggplot.
pivot_longer(cols = -date, names_to = "dwelling_type", values_to = "ratio")
if(nrow(national_dwelling_ratios_long) > 0){
plot7 <- ggplot(national_dwelling_ratios_long, aes(x = date, y = ratio, fill = dwelling_type)) +
geom_area(alpha = 0.85, position = "stack") + # Stacked area chart to show proportions.
scale_y_continuous(labels = percent_format()) + # Y-axis as percentage.
scale_fill_brewer(palette = "Accent", name = "Dwelling Type:") + # A nice colour palette for fills.
labs(
title = "National Dwelling Composition Over Time",
x = "Date",
y = "Proportion of Total Dwellings"
)
# This plot uses the global theme_set I defined earlier.
print(plot7)
} else {
print("Plot 5.7 (Original Plot 7): No data available after filtering.")
}Plot 5.7: National Dwelling Composition Over Time.
Owner Occupied Dwellings:
The proportion of owner-occupied dwellings (the green
area) consistently represents the largest share of total dwellings
throughout the observed period. This stratum appears relatively stable
over time, though there might be a very slight, gradual decrease. It
consistently accounts for more than 60% of all dwellings,
indicating that outright ownership or mortgage-based ownership is the
dominant tenure type.
Rented Dwellings:
The proportion of rented dwellings (the orange area) is
the second-largest component of the national housing stock. There
appears to be a slight but noticeable increasing trend in this
proportion over the years shown. Starting from around 30%,
the ratio of rented dwellings seems to have gradually risen, suggesting
a growing rental sector relative to the total housing.
Provided Free Dwellings:
Dwellings Provided Free (the purple area) constitute the
smallest proportion of the three tenure types. This category remains
consistently small and relatively stable across the time series,
representing a low single-digit percentage (likely around
5-7%) of the total dwellings. Its contribution to the
overall housing composition shows minimal fluctuation.
After looking at the overall trends and distributions of actual price and rent levels, I wanted to dig into their growth rates. For property prices, I decided to use boxplots to see the distribution of monthly price growth for each year, and I did this separately for each broad region using facets. This can show me which years had particularly high or low growth, how much growth varied within each year (the size of the box and whiskers), and if there were many outlier months.
# Plot 8 in my R script: Property Price Growth (Monthly) - Boxplot by Year, Faceted
# I'm filtering for my key regions and making sure price_growth_monthly is a valid number.
plot8_data <- full_data %>%
filter(region %in% key_regions_focus & !is.na(price_growth_monthly) & is.finite(price_growth_monthly)) %>%
mutate(year_factor = factor(year)) # Treating 'year' as a factor for the x-axis categories.
if(nrow(plot8_data) > 0){
plot8 <- ggplot(plot8_data, aes(x = year_factor, y = price_growth_monthly, fill = region)) +
# Boxplots are great for summarising distributions.
# I am not showing the legend for 'fill' here because the facets already separate regions.
geom_boxplot(show.legend = FALSE, outlier.alpha = 0.3, outlier.size=0.8, linewidth=0.3) +
# A horizontal line at zero helps see positive vs. negative growth months.
geom_hline(yintercept = 0, linetype = "dotted", color = "red4", linewidth=0.6) +
# Formatting the y-axis as percentages.
scale_y_continuous(labels = percent_format(accuracy = 1L)) + # Changed accuracy for potentially small percentages
scale_fill_brewer(palette = "Pastel2") + # A nice soft colour palette.
facet_wrap(~region, scales = "free_y", ncol = 3) + # Free y-scales because growth ranges might differ by region.
labs(
title = "Distribution of Monthly Property Price Growth by Year",
subtitle="Faceted by Broad Region",
x = "Year",
y = "Monthly Price Growth (%)" # Added units to y-axis label
) +
theme(
axis.text.x = element_text(angle = 45, hjust = 1, size=rel(0.8)), # Angling x-axis text for readability.
strip.text = element_text(size=rel(0.85)) # Adjusting facet label size.
)
print(plot8)
} else {
print("Plot 5.8 (Original Plot 8): No data available after filtering.")
}Plot 5.8: Distribution of Monthly Property Price Growth by Year, Faceted by Broad Region.
Auckland
Canterbury
Wellington
Rest of North Island
Rest of South Island
National
After looking at the distributions of property price growth, I wanted to see a similar view for Rent CPI growth. Instead of boxplots by year, for this one I chose to plot the monthly Rent CPI growth rate as a time series line for each region. Faceting by region helps to see each trend clearly, and I’ve added a horizontal line at 0% to easily distinguish periods of rent growth from rent decline.
# Plot 9 in my R script: Rent CPI Growth (Monthly) - Line plot, Faceted
# Filtering for my key regions and ensuring rent_growth_monthly is a valid number.
plot9_data <- full_data %>%
filter(region %in% key_regions_focus & !is.na(rent_growth_monthly) & is.finite(rent_growth_monthly))
if(nrow(plot9_data) > 0){
plot9 <- ggplot(plot9_data, aes(x = date, y = rent_growth_monthly, color = region, group = region)) +
# Using geom_line to see the trend over time.
# Since I am faceting by region and also colouring by region, the legend isn't strictly necessary here.
geom_line(alpha = 0.8, linewidth=0.9, show.legend = FALSE) +
# A line at y=0 helps to quickly see positive vs. negative growth.
geom_hline(yintercept = 0, linetype = "dashed", color = "grey30", linewidth=0.6) +
# Formatting y-axis as percentages.
scale_y_continuous(labels = percent_format(accuracy = 0.1)) +
# Assigning colours to regions, though they are mostly distinguished by facets here.
scale_color_brewer(palette = "Set2", name="Region:") +
labs(
title = "Monthly Rent CPI Growth Over Time",
subtitle="Faceted by Broad Region",
x = "Date",
y = "Monthly Rent CPI Growth (%)" # Added units
) +
facet_wrap(~region, scales="free_y", ncol=2) + # Free y-scales for each region's growth range.
# Hiding the legend as the facets make it clear which region is which.
theme(legend.position = "none", strip.text = element_text(size=rel(0.85)))
# This plot uses the global theme_set.
print(plot9)
} else {
print("Plot 5.9 (Original Plot 9): No data available after filtering.")
}Plot 5.9: Monthly Rent CPI Growth Over Time, Faceted by Broad Region.
Auckland:
Monthly Rent CPI growth in Auckland shows consistent volatility, with frequent fluctuations typically ranging between -1.0% and +2.0%. There are no very long, sustained periods of exceptionally high or low growth, but rather a persistent pattern of small monthly increases and occasional small decreases. The mean growth appears to be slightly positive over the entire period.
Canterbury:
Canterbury’s monthly Rent CPI growth also exhibits considerable volatility. The typical range of fluctuations appears to be roughly between -2.5% and +2.5%. There are noticeable spikes, both positive and negative, throughout the time series. For instance, there are some sharper positive growth spikes visible around 2012–2014, and also some distinct negative spikes at other times.
Wellington:
Wellington’s monthly Rent CPI growth is also quite volatile, with fluctuations often ranging between -2.5% and +2.5%, but with notable outlier spikes reaching up to +5.0% (e.g., around 2008) and down to nearly -5.0% (e.g., around 2009 and late 2019/early 2020). The time series shows periods of more clustered positive growth and other periods with more negative dips.
Rest of North Island:
Monthly Rent CPI growth in the Rest of the North Island is also volatile, generally fluctuating between approximately -2.0% and +2.0%. The amplitude of these fluctuations seems relatively consistent over the observed period. There are no obvious long-term shifts in the average growth rate, which hovers around zero, perhaps slightly positive.
Rest of South Island:
This region shows the most dramatic volatility in monthly Rent CPI growth, with a y-axis scale ranging from -5.0% to +10.0%. There are several very large positive spikes (e.g., approaching 10% around 2008 and another significant one around 2011–2012) and also some sharp negative spikes. This indicates a much less stable pattern of monthly rent changes compared to other regions. The variance in growth rates is clearly higher here.
National:
The national monthly Rent CPI growth displays a time series pattern that averages out some of the more extreme regional fluctuations, but still shows clear volatility. Growth rates generally move between -1.0% and +2.0%, with occasional spikes beyond this range (e.g., a positive spike near 3% around 2008 and another around 2016). The overall trend seems to be centered slightly above zero.
To get a different view of how property prices relate to rents, I decided to calculate the average Price-to-Rent CPI ratio for each region, each year. This analysis looks at the average Price-to-Rent CPI ratio for different New Zealand regions from 2007 to 2020. A higher ratio generally means properties are relatively more expensive compared to the cost of renting in that area at that time. I thought a heatmap would be a good way to visualise this, as it can quickly show which regions and years had higher or lower ratios. I excluded the “National” figures here to get a better colour scale for the regional variations.
# Plot 10 in my R script: Price-to-Rent CPI Ratio Heatmap by Region and Year
# First, I need to calculate the average price_rent_cpi_ratio for each region and year.
# I'm filtering out any NA or non-finite values and also excluding the "National" region
# to let the regional differences show up better in the colour scale of the heatmap.
heatmap_data_price_rent_ratio <- full_data %>%
filter(!is.na(price_rent_cpi_ratio) & is.finite(price_rent_cpi_ratio) & region != "National") %>%
group_by(year, region) %>%
summarise(avg_price_rent_cpi_ratio = mean(price_rent_cpi_ratio, na.rm = TRUE), .groups = "drop")
if(nrow(heatmap_data_price_rent_ratio) > 0){
plot10 <- ggplot(heatmap_data_price_rent_ratio,
aes(x = factor(year), # Treating year as a categorical factor for the x-axis
# Reordering regions on the y-axis based on their median ratio, highest first.
y = fct_reorder(region, avg_price_rent_cpi_ratio, .fun = median, .desc = TRUE, na.rm=TRUE),
fill = avg_price_rent_cpi_ratio)) + # Colour intensity based on the ratio
geom_tile(color = "white", linewidth=0.2) + # geom_tile creates the heatmap cells, with white borders.
# Using the 'plasma' viridis colour scale, which is good for continuous data and colourblind-friendly.
# Reversing direction so higher values are "hotter" (depends on palette choice).
scale_fill_viridis_c(option = "plasma", name = "Avg. Price/\nRent CPI Ratio", labels = comma, direction = -1) +
labs(
title = "Heatmap: Average Price-to-Rent CPI Ratio",
subtitle="Excluding 'National'. 'Rest of...' regions use averaged prices.",
x = "Year",
y = "Region"
) +
# Customising the theme a bit for a cleaner heatmap look.
theme(
axis.text.x = element_text(angle = 45, hjust = 1, size=rel(0.9)), # Angling year labels.
legend.position = "right",
panel.grid = element_blank(), # Removing grid lines.
panel.border = element_blank() # Removing panel border.
)
# This plot will inherit the overall plot.background from theme_set.
print(plot10)
} else {
print("Plot 5.10 (Original Plot 10): No data available after filtering for the heatmap.")
}Plot 5.10: Heatmap of Average Price-to-Rent CPI Ratio by Region and Year.
Auckland:
Auckland consistently displays the highest Price-to-Rent CPI ratios
across all years, shown by the persistent dark purple/blue colors.
There’s a clear upward trend from around 2012 onward, with the ratio
peaking between 2016 and 2020. This suggests that property values in
Auckland have increasingly outpaced rent levels, making it the most
expensive region relative to rental returns.
Wellington:
Wellington generally has the second-highest Price-to-Rent CPI ratios,
though still lower than Auckland. The region mostly shows orange to
reddish-purple tones, indicating moderate to high ratios. The trend
becomes more noticeable from around 2015, with ratios rising steadily
into the later years, reflecting growing property prices relative to
rents.
Canterbury:
Canterbury’s ratios fall in the mid-range compared to other regions,
represented by orange and yellowish-orange shades. The pattern remains
relatively stable over time, with a modest increase between 2013 and
2015, followed by a plateau or slight decline, then a small upward
movement closer to 2020. This indicates moderate property price
increases relative to rents.
Rest of North Island:
This region displays consistently lower ratios, indicated by mostly
yellow and light orange colors. A gradual increase is visible over time,
but the ratios remain significantly lower than those in Auckland or
Wellington. This suggests that in the Rest of the North Island, property
prices have remained more aligned with rent levels.
Rest of South Island:
The Rest of South Island shows the lowest Price-to-Rent CPI ratios
throughout the period, represented by bright yellow tones. There is
little change over time, with ratios remaining stable and low. This
reflects a more balanced or affordable relationship between property
prices and rental values in this region.
Sometimes the raw monthly data can be a bit noisy, making it harder to see the main trend. To help with this, I calculated a 3-month rolling average for the Rent CPI. Plotting this smoothed line alongside the actual CPI values can give a clearer picture of the underlying direction of rents in each region. I’ve faceted this by region again to compare.
# Plot 11 in my R script: Smoothed Rent CPI (3-month Rolling Avg) - Faceted by Broad region
# Filtering for key regions and ensuring both actual CPI and the rolling average are available.
plot11_data <- full_data %>%
filter(region %in% key_regions_focus & !is.na(rent_cpi_roll_avg3) & !is.na(rent_cpi))
if(nrow(plot11_data) > 0){
plot11 <- ggplot(plot11_data, aes(x = date)) +
# Plotting the actual CPI with a lighter colour and some transparency.
geom_line(aes(y = rent_cpi, colour = "Actual CPI"), alpha = 0.5, linewidth=0.7) +
# Overlaying the 3-month rolling average with a more prominent line.
geom_line(aes(y = rent_cpi_roll_avg3, colour = "3-Month Rolling Avg"), linewidth = 1.1) +
facet_wrap(~region, scales = "free_y", ncol=3) + # Free y-scales for each region.
# Setting specific colours for the lines to make them distinct.
scale_colour_manual(values = c("Actual CPI" = "grey60", "3-Month Rolling Avg" = "steelblue3"), name="Metric:") +
labs(
title = "Smoothed Rent CPI (3-Month Rolling Average)",
subtitle="Faceted by Broad Region",
x = "Date",
y = "Rent CPI"
) +
theme(legend.position = "top", strip.text = element_text(size=rel(0.85))) # Legend at the top for clarity.
# This plot uses the global theme_set.
print(plot11)
} else {
print("Plot 5.11 (Original Plot 11): No data available after filtering.")
}Plot 5.11: Smoothed Rent CPI (3-Month Rolling Average) by Broad Region.
Auckland:
Auckland’s 3-month rolling average Rent CPI shows a smooth and
consistent upward trend, rising from around 1000 in 2007 to over 1500 by
early 2020. The smoothed line minimizes monthly volatility and reveals a
steady growth pattern with no major periods of sharp acceleration or
stagnation.
Canterbury:
Canterbury exhibits a sharp rise in the smoothed Rent CPI from around
2011 to 2015, increasing from about 1100 to over 1400. After this period
of rapid growth, the curve flattens significantly, indicating much
slower growth or stabilization from 2015 to 2020, with the CPI hovering
between 1400–1450.
Wellington:
Wellington’s smoothed Rent CPI follows a clear upward path, starting
near 1000–1100 and rising to above 1500 by 2020. The growth appears to
accelerate slightly from 2015 onward. The rolling average captures this
overall upward shift while reducing the noise in monthly
fluctuations.
Rest of North Island:
This region shows a strong and steady upward trend in the smoothed Rent
CPI, increasing from around 1000 in late 2007 to about 1600 by early
2020. The growth rate appears consistent throughout the time period.
Rest of South Island:
The smoothed Rent CPI here increases from about 1000 to just over 1400,
but the underlying actual CPI line shows much greater volatility. While
the rolling average smooths out some of this noise, it still reflects
more irregularities compared to other regions, with visible phases of
faster and slower growth.
National:
The national 3-month rolling average Rent CPI rises steadily from 1000
to roughly 1500 between 2007 and 2020. The smoothing highlights a clear,
nearly linear upward trend across the entire period, masking
shorter-term fluctuations.
After looking at the growth rates themselves, I was curious about how stable or volatile those growth rates have been. To measure this, I calculated a 12-month rolling standard deviation of the monthly property price growth for each region. A higher value on this plot means more volatility (bigger swings in the growth rate from month to month over the past year), while a lower value suggests more stable or predictable growth. I’ve faceted this by region to compare.
# Plot 12 in my R script: Volatility of Property Price Growth (12-month Rolling SD)
# Calculating the 12-month rolling standard deviation of price_growth_monthly for each region.
# Using zoo::rollapply for this.
plot12_data <- full_data %>%
group_by(region) %>% # Calculation needs to be done per region.
arrange(date) %>% # Ensuring data is in chronological order for rollapply.
mutate(
price_growth_volatility_12m = zoo::rollapply(
data = price_growth_monthly,
width = 12, # 12-month window
FUN = sd, # Standard deviation function
fill = NA, # Fill initial periods (where full window not available) with NA
align = "right", # Align window to the right (current and past 11 months)
na.rm = TRUE # Remove NAs within the window before calculating sd
)
) %>%
ungroup() %>%
# Filtering for key regions and valid volatility data.
filter(region %in% key_regions_focus & !is.na(price_growth_volatility_12m) & is.finite(price_growth_volatility_12m))
if(nrow(plot12_data) > 0){
plot12 <- ggplot(plot12_data, aes(x = date, y = price_growth_volatility_12m, color = region, group=region)) +
geom_line(linewidth = 0.9, show.legend=FALSE) + # Legend off as faceting by region.
# Y-axis as percentage, though it's a standard deviation of percentages.
scale_y_continuous(labels = percent_format(accuracy=0.1)) +
scale_color_brewer(palette = "Set1", name="Region:") +
labs(
title = "Monthly Property Price Growth Volatility",
subtitle="12-Month Rolling Standard Deviation, Faceted by Broad Region",
x = "Date",
y = "Std. Dev of Monthly Price Growth"
) +
facet_wrap(~region, scales="free_y", ncol=2) + # Free y-scales as volatility levels can differ.
theme(strip.text = element_text(size=rel(0.85)))
# This plot uses the global theme_set.
print(plot12)
} else {
print("Plot 5.12 (Original Plot 12): No data available after filtering.")
}Plot 5.12: Volatility of Monthly Property Price Growth (12-Month Rolling Standard Deviation), Faceted by Broad Region.
Auckland:
Auckland’s property price growth volatility fluctuated over time. It was relatively high (around 3.0%–3.5%) during the 2008–2009 period, dropped to about 2.0%–2.5% between 2010–2012, then peaked again around 2013–2014 (~4.5%). After that, volatility remained moderately high, generally moving between 2.5% and 4.0% through to 2020.
Canterbury:
Canterbury began with moderate volatility (~2.0%–2.5%) and saw a noticeable spike around 2012 (~4.0%). This was followed by a decreasing trend until about 2016 (dipping below 2.0%), before volatility increased again, peaking near 4.0% around 2019. These shifts reflect a cyclical volatility pattern.
Wellington:
Wellington had the highest volatility overall, starting around 4.0%–5.0% in 2008, dropping below 3.0% by 2013–2014, and then rising sharply to over 7.0% in 2016–2017. In recent years, volatility remained high, staying above 5.0%.
Rest of North Island:
This region showed a high volatility level around 2008–2009 (~3.0%), then another peak around 2011–2012 (~3.3%) and a stronger one around 2015 (~4.0%). From 2016 onwards, volatility became more stable and stayed lower, generally within the 1.5%–2.5% range.
Rest of South Island:
The Rest of the South Island experienced large swings. Volatility rose to ~4.0% by 2010, fell sharply to just above 1.0% around 2015, and then climbed again above 3.0% by 2020. This suggests alternating periods of stable and unstable price growth.
National:
National property price growth volatility was lowest around 2008 (~1.5%) and steadily increased to a peak near 5.0% around 2014–2015. Though it declined slightly afterward, volatility stayed relatively elevated (between 2.5% and 4.5%) for the rest of the period.
To get a more current snapshot of the price-to-rent situation, I decided to look at the average Price-to-Rent CPI ratio for each region in the most recent full year of my data. I then filtered for the top 5 regions with the highest ratios, excluding “National” to focus on regional specifics. A higher ratio here could suggest that properties in these regions are relatively more expensive to buy compared to the cost of renting, or perhaps offer different investment dynamics.
# Plot 13 in my R script: Top 5 Regions by Current Average Price-to-Rent CPI Ratio
# First, I find the most recent full year in my data.
current_year_val <- max(full_data$year, na.rm = TRUE)
# Then, I calculate the average price_rent_cpi_ratio for that year, for each region,
# excluding "National" and any NA/non-finite ratios.
# Finally, I pick the top 5 regions with the highest average ratios.
top5_price_rent_regions <- full_data %>%
filter(year == current_year_val &
!is.na(price_rent_cpi_ratio) &
is.finite(price_rent_cpi_ratio) &
region != "National") %>%
group_by(region) %>%
summarise(avg_ratio = mean(price_rent_cpi_ratio, na.rm = TRUE), .groups = "drop") %>%
arrange(desc(avg_ratio)) %>%
slice_head(n = 5) # dplyr::slice_head() is good for this
if(nrow(top5_price_rent_regions) > 0){
plot13 <- ggplot(top5_price_rent_regions, aes(x = reorder(region, avg_ratio), y = avg_ratio, fill = region)) +
geom_col(show.legend = FALSE, alpha=0.85) + # Bar chart, no legend needed as regions are on axis.
# Adding the actual ratio value as text on the bars.
geom_text(aes(label=sprintf("%.1f", avg_ratio)), hjust = -0.2, size=3.5, fontface="bold", color="grey20") +
# Flipping coordinates to make it a horizontal bar chart - often easier to read region names.
coord_flip(ylim=c(0, max(top5_price_rent_regions$avg_ratio, na.rm=T)*1.18)) +
scale_fill_brewer(palette = "Pastel1") + # Using the palette from your script.
labs(
title = paste("Top 5 Regions by Avg. Price-to-Rent CPI Ratio in", current_year_val),
subtitle = "Excludes 'National'. 'Rest of...' categories use averaged prices for their median_price component.",
x = "Region",
y = "Average Price-to-Rent CPI Ratio"
) +
# Customising the theme slightly for a horizontal bar chart.
theme(
panel.grid.major.x = element_line(colour = "grey80"), # Keep horizontal grid lines
panel.grid.major.y = element_blank() # Remove vertical grid lines for coord_flip
)
# This plot uses the global theme_set for overall background.
print(plot13)
} else {
print("Plot 5.13 (Original Plot 13): No data available after filtering for top regions.")
}Plot 5.13: Top 5 Regions by Average Price-to-Rent CPI Ratio in the Most Recent Year.
Auckland:
Auckland had the highest average Price-to-Rent CPI ratio in 2020,
reaching 596.6. This was well above all other regions,
confirming that property prices in Auckland were significantly higher
relative to rents, consistent with earlier observations.
Wellington:
Wellington followed with a ratio of 435.4, showing
property prices were also high relative to rents but with a large gap
compared to Auckland (over 160 points), placing it clearly in a second
tier.
Canterbury:
With an average ratio of 332.4, Canterbury ranked
third. This indicates a moderate level of price-to-rent imbalance, much
lower than both Auckland and Wellington, suggesting greater
affordability in comparison.
Rest of North Island:
This region had a ratio of 324.3, nearly the same as
Canterbury, pointing to similar price-to-rent dynamics and relatively
lower property price pressure.
Rest of South Island:
The lowest among the top five, with a ratio of 322.9,
the Rest of South Island had property prices closest to rental levels,
indicating the least disparity between the two metrics in 2020 among
these regions.
I was curious to see if there was any visible relationship between the median property prices in specific regions and the broader national trend of owner-occupation rates. My thinking was that as national homeownership rates change (perhaps due to affordability or policy), it might correlate with regional price movements, though I suspected this might be a less direct relationship than some others.
To explore this, I created scatter plots for each region, plotting its median property price against the national owner-occupied dwelling percentage. I added a linear regression line and, importantly, the R-squared and p-value to each facet to help judge if any observed trend was statistically meaningful. I excluded the “National” region itself from this particular faceted plot to focus on the other broad regions.
# Plot 14 in my R script: Property Price vs. National Owner-Occupied Ratio - Faceted (Enhanced)
# Filtering for key regions (excluding "National" for this specific facet view),
# and ensuring the necessary data points are available.
plot14_data <- full_data %>%
filter(region %in% key_regions_focus &
!is.na(owner_ratio_national) &
!is.na(median_price) &
region != "National") # Excluding National as we're comparing regions to the national trend
if(nrow(plot14_data) > 0){
plot14 <- ggplot(plot14_data, aes(x = owner_ratio_national, y = median_price)) +
# Points are coloured by region (though facets also separate them).
# Using a consistent colour might also work well here.
geom_point(aes(color = region), alpha = 0.25, show.legend = FALSE, size=1) +
# Linear smooth line to show the general trend within each facet.
geom_smooth(method = "lm", se = FALSE, color = "black", linetype = "dashed", linewidth=0.6) +
# Adding R-squared and p-value to each facet using ggpubr.
# Using after_stat() for labels which is the current best practice.
ggpubr::stat_cor(method = "pearson",
aes(label = paste(after_stat(rr.label), # R-squared label
# Formatting p-value nicely
if_else(readr::parse_number(after_stat(p.label)) < 0.001,
"p < 0.001",
paste("p =", scales::pvalue(readr::parse_number(after_stat(p.label)), accuracy=0.001, add_p = FALSE))
),
sep = "~`,`~")), # Separator
label.x.npc = "left", label.y.npc = "top", # Position of the text
r.digits = 2, p.digits = 3, # Number of digits
size = 2.8, colour = "grey10") + # Text styling
scale_x_continuous(labels = percent_format()) + # X-axis as percentage
scale_y_continuous(labels = dollar_format(prefix="$", scale=1/1000, suffix="K")) + # Y-axis as thousands of dollars
scale_color_brewer(palette = "Set2") + # Colour palette for points
facet_wrap(~region, scales = "free", ncol = 3) + # Free scales as relationships might differ
labs(
title = "Regional Property Prices vs. National Owner-Occupied Trends",
subtitle = "Linear relationship shown with R-squared and p-value for each region.",
x = "National Proportion of Owner-Occupied Dwellings",
y = "Median Property Price (Regional, $K)" # Clarified y-axis unit
) +
theme(strip.text = element_text(size = rel(0.85))) # Adjusting facet label size
# This plot uses the global theme_set.
print(plot14)
} else {
print("Plot 5.14 (Original Plot 14): No data available after filtering.")
}Plot 5.14: Regional Property Prices vs. National Owner-Occupied Dwelling Percentage.
Auckland:
Auckland shows a strong negative linear relationship between national owner-occupied percentage and median property prices, with R² = 0.79 and p < 0.001. This means that as fewer people own homes nationally, Auckland prices tend to rise significantly, and this pattern is statistically robust.
Canterbury:
Canterbury also has a strong and significant relationship (R² = 0.72, p < 0.001), suggesting that around 72% of price variation is tied to changes in national home ownership. Price increases are clearly linked to drops in owner-occupation.
Wellington:
Wellington mirrors Canterbury, with a strong and significant negative correlation (R² = 0.72, p < 0.001). Falling national owner-occupied rates are strongly associated with rising Wellington house prices.
Rest of North Island:
The region displays a moderate relationship with R² = 0.48 (p < 0.001). While the trend is still negative and significant, only about half the price variation is explained by changes in owner-occupation.
Rest of South Island:
Here, the negative relationship is weaker but still statistically significant (R² = 0.23, p < 0.001). Owner-occupation rates have a smaller explanatory role in driving price movements in this region.
For my final EDA plot, I wanted to directly compare the growth rates
of property prices and rents in the most recent full year of data I
have. I calculated the average annualised monthly growth for both
median_price and rent_cpi for each region
(excluding “National” to focus on regional dynamics).
Plotting these against each other helps me see: * Which regions had
high or low growth in both prices and rents. * Whether price growth and
rent growth were similar (points close to the diagonal line where x=y).
* Or if one was significantly outpacing the other (points far from the
diagonal). The size of the points is mapped to the absolute difference
between the two growth rates, highlighting regions where the divergence
is largest. I have used ggrepel to label the points with
region names so it’s easy to see which is which.
# Plot 15 in my R script: Regional Price & Rent CPI Growth Comparison
# First, I determine the most recent 'full' year in my data to ensure I'm comparing complete annual figures.
recent_full_year_val <- if(length(unique(full_data$year[!is.na(full_data$year)])) > 1) {
# If more than one year of data, take the year before the latest one.
max(full_data$year, na.rm=TRUE) -1
} else {
# Otherwise, just use the latest year (might be incomplete but it's all I have).
max(full_data$year, na.rm=TRUE)
}
# Calculating average annualised monthly growth for price and rent for each region in that year.
growth_comparison_data <- full_data %>%
filter(year == recent_full_year_val &
!is.na(price_growth_monthly) & is.finite(price_growth_monthly) & # Ensure valid growth numbers
!is.na(rent_growth_monthly) & is.finite(rent_growth_monthly) &
region != "National") %>% # Excluding "National" to focus on regions.
group_by(region) %>%
# Annualising the average monthly growth rate by multiplying by 12.
summarise(avg_annual_price_growth = mean(price_growth_monthly*12, na.rm = TRUE),
avg_annual_rent_growth = mean(rent_growth_monthly*12, na.rm = TRUE),
.groups = "drop") %>%
drop_na() # Removing any regions where calculation might have failed.
if(nrow(growth_comparison_data) > 0){
plot15 <- ggplot(growth_comparison_data,
aes(x = avg_annual_rent_growth, y = avg_annual_price_growth)) +
# Using shape 21 allows for both fill and colour aesthetics.
# Size of point shows the absolute difference between price and rent growth.
geom_point(aes(fill = region, size = abs(avg_annual_price_growth - avg_annual_rent_growth)),
alpha = 0.8, shape=21, stroke=0.6, color="white", show.legend=FALSE) +
# Diagonal line where price growth = rent growth.
geom_abline(slope = 1, intercept = 0, linetype = "twodash", color = "grey40", linewidth=0.8) +
# Using ggrepel to label regions without too much overlap.
ggrepel::geom_text_repel(aes(label = region, colour=region),
size = 3.5, max.overlaps = 20,
segment.alpha = 0.6, fontface="italic",
box.padding = 0.5, point.padding = 0.5,
show.legend = FALSE) + # Hiding legend for text labels too.
scale_x_continuous(labels = percent_format()) + # X-axis as percentage.
scale_y_continuous(labels = percent_format()) + # Y-axis as percentage.
scale_fill_brewer(palette = "Set3", name="Region (Fill):") + # Colour palette for point fills.
scale_colour_brewer(palette="Dark2", name="Region (Label):") + # Different palette for text labels.
scale_size_continuous(name="Growth Difference (Abs):") + # Legend for point size.
labs(
title = paste("Annualised Property Price Growth vs. Rent CPI Growth in", recent_full_year_val),
subtitle="Bubble size indicates absolute difference between price and rent growth. Excludes 'National'.",
x = "Average Annualised Monthly Rent CPI Growth",
y = "Average Annualised Monthly Property Price Growth"
) +
# Adjusting legend position if needed, or it will use the global theme setting.
theme(legend.position = "right", legend.box = "vertical")
# This plot uses the global theme_set.
print(plot15)
} else {
print("Plot 5.15 (Original Plot 15): No data available after filtering for growth comparison.")
}Plot 5.15: Annualised Property Price Growth vs. Rent CPI Growth in the Most Recent Full Year.
Auckland:
Auckland showed moderate growth: property prices rose
~3–4%, and rents ~2.7%. It’s slightly above the
diagonal, indicating a small price–rent growth difference. The
small bubble highlights its more balanced market
conditions in 2019.
Canterbury:
Canterbury had the lowest growth in both prices
(~1.5–2%) and rents (~1.6%). It is positioned just below the
diagonal, suggesting rent growth slightly outpaced property
prices. The smallest bubble shows minimal divergence
between price and rent growth.
Wellington:
Wellington also showed strong property price growth
(~12–13%) and rent growth of about 4.1%. It is above the
diagonal line, indicating a clear outperformance of prices over
rents. The bubble is large, second only to the Rest of
South Island.
Rest of North Island:
This region had the highest rent CPI growth (~5%) along
with high property price growth (~12.5–13.5%). It lies above the
diagonal, meaning prices rose faster than rents. The
notable bubble size reflects the significant gap
between these growth rates.
Rest of South Island:
This region recorded the highest annualised property price
growth in 2019 (~13–14%), with rent CPI growth at around 3.8%.
It sits well above the diagonal line, showing that
property prices surged far more than rents. It also has the
largest bubble, reinforcing this wide gap.
END OF EXPLANATORY DATA ANALSYS
After exploring the data, my next big goal was to see if I could
build models to predict the Rent Consumer Price Index
(rent_cpi). I decided to try a few different machine
learning algorithms, from a basic Linear Regression to more complex
ensemble methods like Random Forest and XGBoost, to see which would
perform best. The whole process involved careful data preparation
specific to modelling, robust preprocessing using a recipe,
cross-validation for tuning and evaluation, and finally, checking
performance on an unseen test set.
Before I could train any models, I needed to prepare a specific
dataset for this task. This involved: 1. Selecting the outcome variable
(rent_cpi) and the predictor variables (features) I thought
would be most relevant from my full_data table. This
includes the features I engineered earlier. 2. Handling any missing
values (NAs), especially for the target variable and key
predictors, to ensure the models get clean data. 3. Explicitly creating
numerical date-derived features like year,
month, quarter, and half_year
from the date column. My plan was to then convert
month, quarter, and half_year
into factors within the recipe so they could be properly dummy-coded for
the models.
# --- 1. Define Feature Set for Modelling ---
# These are the columns I have chosen from 'full_data' to use as potential predictors.
# I'm also keeping 'date' and 'region' for now, as they'll be set to 'ID' roles in the recipe.
ml_feature_cols <- c(
"median_price", # Median property price in the region.
"rented_ratio_national", # National proportion of dwellings that are rented.
"owner_ratio_national", # National proportion of owner-occupied dwellings.
"free_ratio_national", # National proportion of dwellings provided free.
"price_rent_cpi_ratio", # Ratio of median property price to Rent CPI in the region.
"log_median_price", # Log-transformed median property price.
"price_growth_monthly", # Monthly growth rate of median property price in the region.
"rent_growth_monthly" # Monthly growth rate of Rent CPI in the region.
)
# --- 2. Initial Data Selection ---
# Creating 'ml_data_pre_split' by selecting the outcome, IDs, and the features defined above.
ml_data_pre_split <- full_data %>%
select(rent_cpi, date, region, all_of(ml_feature_cols)) %>%
drop_na(rent_cpi) # Crucial: Rows with a missing target variable can't be used.
# --- 3. Sanity Checks & Further NA Handling for Predictors ---
# This block ensures the data going into the ML pipeline is viable.
if (nrow(ml_data_pre_split) > 0) {
print("--- ML Data Sanity Checks ---")
print(paste("Rows before dropping NAs in predictors:", nrow(ml_data_pre_split)))
# Further dropping rows if any of these specific key predictors have NAs.
# While the recipe will handle some imputation later, this initial drop ensures that
# rows with missing data in critical analytical features are removed.
ml_data_final <- ml_data_pre_split %>%
drop_na(any_of(c("median_price", "price_rent_cpi_ratio",
"rented_ratio_national", "price_growth_monthly", "rent_growth_monthly")))
print(paste("Rows after dropping NAs in key predictors:", nrow(ml_data_final)))
# Checking if the target variable has sufficient variance after all this filtering.
# If variance is very low or zero, modelling might not be meaningful.
if (nrow(ml_data_final) > 0 && var(ml_data_final$rent_cpi, na.rm = TRUE) < 1e-6) {
print("Warning: Target 'rent_cpi' has near-zero variance after filtering. ML might not be meaningful.")
ml_data_final <- data.frame() # Creating an empty dataframe to gracefully skip ML if target variance is an issue.
}
} else {
ml_data_final <- data.frame() # Starting with an empty dataframe if pre_split was already empty.
}## [1] "--- ML Data Sanity Checks ---"
## [1] "Rows before dropping NAs in predictors: 972"
## [1] "Rows after dropping NAs in key predictors: 972"
# --- 4. Add Date-Derived Features Explicitly ---
# Creating numerical year, month, quarter, and half_year features from the 'date' column.
# My recipe will later convert 'month', 'quarter', and 'half_year' into factors for modelling.
if (nrow(ml_data_final) > 0) {
ml_data_final <- ml_data_final %>%
mutate(
year = lubridate::year(date),
month = lubridate::month(date), # Will be converted to factor in recipe
quarter = lubridate::quarter(date), # Will be converted to factor in recipe
half_year = ifelse(month %in% 1:6, 1, 2) # Will be converted to factor in recipe
)
}
# Quick check of the data that will be used for modelling.
print("Sample of ml_data_final (first few rows, selected columns):")## [1] "Sample of ml_data_final (first few rows, selected columns):"
if(nrow(ml_data_final) > 0) print(head(select(ml_data_final, date, region, rent_cpi, median_price, year, month)))## # A tibble: 6 × 6
## date region rent_cpi median_price year month
## <date> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 2007-01-01 Auckland 1007 419500 2007 1
## 2 2007-01-01 Wellington 1010 350000 2007 1
## 3 2007-01-01 Rest of North Island 1003 280571. 2007 1
## 4 2007-01-01 Canterbury 1023 294800 2007 1
## 5 2007-01-01 Rest of South Island 1052 257333. 2007 1
## 6 2007-01-01 National 1025 325500 2007 1
This ml_data_final table is what I will use for
splitting into training and testing sets in the next step. Making sure
it’s clean and has the right initial features is pretty important before
diving into the recipe and model training.
With my ml_data_final table ready, the next crucial step
before training any models was to split it into a training set and a
testing set. I decided on a standard 80/20 split, meaning 80% of the
data would be used to train the models, and the remaining 20% would be
held back as unseen data to test how well the models generalise. To make
sure both sets were representative, I used stratified sampling based on
the outcome variable, rent_cpi. This helps ensure that the
distribution of rent_cpi values is similar in both my
training and testing sets. Setting a seed (set.seed(123))
means this split will be the same every time I run the code, which is
great for reproducibility.
Once I had my training data, I defined a preprocessing
recipe using the recipes package (part of
tidymodels). This is a really powerful way to outline all
the data preparation steps that need to happen. The beauty of a recipe
is that these steps are defined once and then consistently applied,
especially during cross-validation and when processing the test set
later on.
My recipe for this project includes several steps: * Role
Assignment: I first told the recipe that region
and the original date column are just ‘ID’ variables, so
they won’t be used as predictors by default. * Date Feature
Conversion: I used step_mutate to convert the
numeric month, quarter, and
half_year columns (that I created earlier from the
date) into proper factors with meaningful labels (like
“Jan”, “Feb” for months; “Q1”, “Q2” for quarters). This is important so
they can be correctly turned into dummy variables. * Handling
New Factor Levels: step_novel is included to deal
with any new factor levels that might pop up in new data (like in a CV
fold or the test set) but weren’t in the original training data used to
prep the recipe. * Zero Variance
Predictors: step_zv removes any predictors that
have only one unique value (i.e., zero variance), as these offer no
predictive information. * Imputation:
step_impute_median fills in any remaining missing values in
numeric columns using the median calculated from the training set. This
ensures the models don’t break due to NAs. * Dummy
Variables: step_dummy converts all my nominal
(factor) predictors into numeric dummy variables. I used
one_hot = FALSE, which creates k-1 dummies for a k-level
factor. This is generally preferred for models like linear regression
that include an intercept, as it avoids perfect multicollinearity from
the dummy set. * Normalisation:
step_normalize scales all numeric predictors to have a mean
of 0 and a standard deviation of 1. This can help some algorithms
perform better. * Correlation Filter: Finally,
step_corr removes one of a pair of numeric predictors if
their absolute correlation is very high (I set the threshold at 0.9),
which helps reduce multicollinearity.
After defining all these steps, I prepped the recipe
using the training data. This ‘prepping’ step estimates any parameters
needed from the training data (like medians for imputation,
means/standard deviations for normalisation, etc.). This
prepped_recipe_for_models object is what I’ll use to apply
these transformations consistently.
# Proceeding only if I have enough data after the previous preparation steps.
if (nrow(ml_data_final) > 100) {
# --- 5a. Data Splitting ---
# Setting a seed for random number generation to ensure my data split is reproducible every time I run the script.
set.seed(123)
# Splitting the data: 80% for training, 20% for testing.
# Stratifying by 'rent_cpi' to maintain similar outcome distributions in both sets, which is good practice.
split <- initial_split(ml_data_final, prop = 0.8, strata = rent_cpi)
train_data <- training(split)
test_data <- testing(split)
print(paste("Training data rows:", nrow(train_data)))
print(paste("Test data rows:", nrow(test_data)))
# --- 5b. Define Preprocessing Recipe ---
# The recipe outlines a series of steps to prepare data for modelling.
# The formula 'rent_cpi ~ .' means I'm predicting 'rent_cpi' using all other columns
# in 'train_data' as predictors by default, unless their roles are changed.
rec <- recipe(rent_cpi ~ ., data = train_data) %>%
# 'region' and 'date' are assigned 'ID' roles: they won't be used as predictors by default
# but are kept in the data for potential reference or specific time-series handling later.
update_role(region, new_role = "ID") %>%
update_role(date, new_role = "ID") %>% # Original date column is now an ID.
# Converting my manually created numeric month, quarter, and half_year into factors
# with proper labels. This is important before creating dummy variables.
step_mutate(
month = factor(month, levels = 1:12, labels = month.abb, ordered = FALSE), # e.g., 1 -> "Jan"
quarter = factor(quarter, levels = 1:4, labels = paste0("Q", 1:4), ordered = FALSE), # e.g., 1 -> "Q1"
half_year = factor(half_year, levels = 1:2, labels = paste0("H", 1:2), ordered = FALSE) # e.g., 1 -> "H1"
) %>%
# Handles any new factor levels that might appear in test/CV data but weren't
# in the training set when the recipe was prepped. It assigns them to a special 'new' level.
step_novel(all_nominal_predictors(), all_factor_predictors(), -all_outcomes(), -has_role("ID")) %>%
# Removes predictors that have only one unique value (zero variance), as they offer no info.
step_zv(all_predictors()) %>%
# Imputes missing values in numeric columns using the median calculated from the training set.
step_impute_median(all_numeric_predictors()) %>%
# Creates dummy (binary 0/1) variables from all nominal (factor) predictors.
# `one_hot = FALSE` means k-1 dummies are created for a k-level factor, good for models with an intercept.
step_dummy(all_nominal_predictors(), -all_outcomes(), -has_role("ID"), one_hot = FALSE) %>%
# Normalises (centres to mean 0 and scales to std dev 1) all numeric predictors.
# This helps algorithms that are sensitive to feature scales.
step_normalize(all_numeric_predictors()) %>%
# Removes one of a pair of numeric predictors if their absolute correlation is above 0.9.
# This helps reduce multicollinearity.
step_corr(all_numeric_predictors(), threshold = 0.9)
print("Recipe defined. Ready for model training.")
# Prepping the recipe once on the training data.
# This 'prepped_recipe_for_models' object now contains all the information needed
# to apply these exact transformations to any new data (like CV folds or the test set).
prepped_recipe_for_models <- prep(rec, training = train_data)
# I can inspect the prepped recipe to see what it learned, e.g., which columns were dummied or removed.
# summary(prepped_recipe_for_models)
# tidy(prepped_recipe_for_models, number = x) # for step x
} else {
# This part ensures that if there wasn't enough data to begin with,
# the script notes it and later model training steps will be skipped.
print("Not enough data for ML modelling after initial preparation. Skipping recipe creation and model training.")
# Setting these to NULL so later chunks can check if models were trained.
train_data <- test_data <- NULL # No data to split
rec <- prepped_recipe_for_models <- NULL
model_lm <- model_lasso <- model_rf <- model_xgb <- NULL
results <- NULL
}## [1] "Training data rows: 776"
## [1] "Test data rows: 196"
## [1] "Recipe defined. Ready for model training."
With my data split and a solid preprocessing recipe in place and prepped, I am now all set to actually train and tune the different machine learning models.
With my training data prepared and my preprocessing recipe defined
and prepped, I was ready to train the actual machine learning models. My
strategy here was to use 10-fold cross-validation for
robust performance estimation and for tuning the hyperparameters of the
more complex models. I set this up using caret’s
trainControl() function. I also enabled
savePredictions = "final" which is useful if I want to look
at the out-of-fold predictions later, and
allowParallel = TRUE which can speed up training if I have
a parallel backend like doParallel registered (though for
this script, I have not explicitly registered one, caret
might still use available cores on some systems).
I decided to train four types of models to compare their performance:
1. Linear Regression (lm): A good, simple
baseline model. 2. Lasso Regression
(glmnet): This is a regularised linear model that
can also perform feature selection by shrinking some coefficients
exactly to zero. I tuned its lambda (regularisation
strength) parameter. 3. Random Forest
(rf): A powerful ensemble method based on decision
trees. I fixed the number of trees (ntree) to 50 (based on
my earlier experimentation where more trees did not offer significant
gains for the added computation) and tuned the mtry
parameter (number of variables randomly sampled at each split) using a
dynamically generated grid. 4. XGBoost
(xgbTree): Another very powerful gradient boosting
ensemble method. I set up a grid (xgb_grid) to tune several
of its key hyperparameters like the number of rounds, tree depth, and
learning rate.
For all models, I used “RMSE” (Root Mean Squared Error) as the primary metric to optimise during tuning, as it is a common and interpretable measure of prediction error for regression tasks.
The R code chunk below runs the training for all these models. This
can take a bit of time, especially for Random Forest and XGBoost with
hyperparameter tuning. So, in this R Markdown file, I have set the chunk
options results='hide' to prevent all the verbose training
logs from caret from cluttering the final report, and
cache=TRUE. The cache=TRUE option is really
useful because R Markdown will save the results of this chunk (the
trained models) and will only re-run the training if the code in this
chunk or the data it depends on actually changes. This saves a lot of
time when I re-knit the document. I will print out summaries of the
models and their performance in the subsequent sections.
# This chunk will run the model training.
# Output logs are hidden in the R Markdown report due to 'results='hide''.
# 'cache=TRUE' will save computation time on subsequent knits if code/data unchanged.
# Checking if I have enough data and if the recipe was prepped before trying to train models.
# This 'if' block should ideally not be strictly necessary here if the Rmd is knitted top-to-bottom,
# as the objects would exist from the previous chunk. However, it adds robustness if running chunks interactively.
if (exists("ml_data_final") && nrow(ml_data_final) > 100 &&
exists("prepped_recipe_for_models") && !is.null(prepped_recipe_for_models) &&
exists("train_data") && !is.null(train_data) ) {
# Defining my cross-validation strategy using caret's trainControl.
ctrl <- trainControl(
method = "cv", # Using k-fold Cross-validation.
number = 10, # Using 10 folds.
savePredictions = "final", # Saves out-of-fold predictions for each resample (useful for some analyses).
allowParallel = TRUE, # Allows parallel processing if a backend like 'doParallel' is registered.
search = "grid" # Using grid search when a tuneGrid is supplied.
)
# --- Linear Regression (LM) ---
# My first model, a standard linear regression. It serves as a good baseline.
# Caret uses the 'prepped_recipe_for_models' to preprocess data automatically during training and CV.
print("Training Linear Model...")
model_lm <- train(prepped_recipe_for_models, # Using the prepped recipe
data = train_data,
method = "lm",
trControl = ctrl,
metric = "RMSE") # Optimising for RMSE.
# --- Lasso Regression (glmnet) ---
# Lasso performs L1 regularisation, which can also do feature selection.
# I am only tuning 'lambda' here (alpha=1 for Lasso).
print("Training Lasso Model...")
lasso_grid <- expand.grid(alpha = 1, lambda = 10^seq(-4, -1, length.out = 10)) # Defining a grid of 10 lambda values.
model_lasso <- train(prepped_recipe_for_models,
data = train_data,
method = "glmnet",
tuneGrid = lasso_grid,
trControl = ctrl,
metric = "RMSE")
# --- Random Forest (rf) ---
# An ensemble method that is usually quite powerful.
# I have set ntree=50 based on my earlier experiments.
# Caret will tune 'mtry' (number of variables randomly sampled at each split).
print("Training Random Forest Model...")
# Dynamically creating a sensible mtry grid based on the actual number of predictors
# *after* the recipe has been preprocessed on the training data. This is important because
# steps like dummy variable creation or correlation removal can change the number of predictors.
temp_juiced_train_rf <- juice(prepped_recipe_for_models) # Getting the processed training data.
# Number of predictors is total columns minus 1 (for the outcome variable 'rent_cpi').
num_actual_predictors_rf <- ncol(temp_juiced_train_rf) - 1
# Defining a set of mtry candidates. These are common heuristics.
mtry_candidates_rf <- unique(floor(c(num_actual_predictors_rf/3, sqrt(num_actual_predictors_rf),
num_actual_predictors_rf*0.2, num_actual_predictors_rf*0.5)))
# Ensuring candidates are valid (greater than 0 and not more than total predictors).
mtry_candidates_rf <- mtry_candidates_rf[mtry_candidates_rf > 0 & mtry_candidates_rf <= num_actual_predictors_rf]
if(length(mtry_candidates_rf) == 0) mtry_candidates_rf <- max(1, floor(num_actual_predictors_rf/3)) # Fallback
if(any(is.na(mtry_candidates_rf)) || length(mtry_candidates_rf) == 0) mtry_candidates_rf <- 2 # Absolute fallback
rf_tuning_grid_final <- expand.grid(.mtry = unique(sort(mtry_candidates_rf[mtry_candidates_rf>0])))
# print("Random Forest mtry tuning grid to be used:"); print(rf_tuning_grid_final) # Can uncomment to see the grid
model_rf <- train(prepped_recipe_for_models,
data = train_data,
method = "rf",
ntree = 50, # Fixed number of trees.
tuneGrid = rf_tuning_grid_final, # Using my defined mtry grid.
trControl = ctrl,
metric = "RMSE",
importance = TRUE) # Calculating variable importance measures.
# --- XGBoost (xgbTree) ---
# Another very powerful gradient boosting ensemble. I am tuning several key parameters using a grid search.
# This can be quite computationally intensive, so the grid size is a balance.
print("Training XGBoost Model...")
xgb_grid <- expand.grid(
nrounds = c(100, 300, 500), # Number of boosting rounds (trees).
max_depth = c(3, 6, 9), # Maximum depth of each tree.
eta = c(0.01, 0.1, 0.3), # Learning rate (shrinkage).
gamma = 0, # Minimum loss reduction for a split (kept fixed at 0 for this grid).
colsample_bytree = 0.8, # Fraction of columns sampled per tree (kept fixed).
min_child_weight = 1, # Minimum sum of instance weights in a child node (kept fixed).
subsample = 0.8 # Fraction of training data sampled per tree (kept fixed).
) # This results in 3*3*3 = 27 combinations for caret to test.
model_xgb <- train(prepped_recipe_for_models,
data = train_data,
method = "xgbTree",
tuneGrid = xgb_grid,
trControl = ctrl,
metric = "RMSE",
verbosity = 0) # Suppressing XGBoost's own messages during training.
} else {
print("Skipping model training as pre-requisites (sufficient data or prepped recipe) are not met.")
}I have set results='hide' for the chunk above because
the training process can print a lot of information to the console,
which would make the report very long. The cache=TRUE
option is great because it means R Markdown will save the results of
this chunk (the trained models) and will only re-run the training if the
code in this chunk or the data it depends on actually changes. This
saves me a lot of time when I re-knit the document.
I will print out summaries of the models and their performance metrics in the next sections of the report.
Now that I have trained my different models (Linear Regression,
Lasso, Random Forest, and XGBoost), the next step is to compare how they
performed during the 10-fold cross-validation process. The
caret package makes this pretty straightforward with the
resamples() function, which collects all the performance
metrics (like RMSE, R-squared, and MAE) from each fold for each
model.
First, I will print a summary table of these resampled metrics. This
gives me a statistical overview (min, median, mean, max) of how each
model did across the different folds. Then, I will use
dotplot() to visualise these distributions, which makes it
easier to quickly compare the models, especially looking at their median
performance and the spread of their results. I am particularly
interested in RMSE (lower is better) and R-squared (higher is
better).
# This chunk assumes that 'model_lm', 'model_lasso', 'model_rf', and 'model_xgb'
# were successfully trained in the previous chunk (ml_train_models_chunk).
# Checking if the model objects exist and are not NULL before proceeding.
if (exists("model_lm") && !is.null(model_lm) &&
exists("model_lasso") && !is.null(model_lasso) &&
exists("model_rf") && !is.null(model_rf) &&
exists("model_xgb") && !is.null(model_xgb)) {
# Collecting all trained model objects into a list for comparison using resamples().
model_list_for_resamples <- list(LM = model_lm, Lasso = model_lasso, RF = model_rf, XGB = model_xgb)
# Getting the resampling results.
results <- resamples(model_list_for_resamples)
print("--- Resamples Summary (Cross-Validation Metrics) ---")
# This summary gives me stats like Min, Median, Mean, Max for each metric (RMSE, Rsquared, MAE)
# across the 10 cross-validation folds for each model.
print(summary(results))
# Now, I will create dot plots to visualise these comparisons.
# It is important to check if the 'results' object and its 'values' component are valid before plotting.
if (!is.null(results) && !is.null(results$values) && nrow(results$values) > 0 && !any(is.na(results$values))) {
print("--- Generating Model Comparison Dot Plots ---")
# For RMSE dotplot
# Using the 'main' argument, passed as a list, to set the title for these lattice-based plots.
# Calling dotplot() directly as it's an S3 method.
tryCatch({
plot_cv_rmse <- dotplot(results,
metric = "RMSE",
main = list(label = "Model Comparison (Cross-Validation RMSE)", cex = 1.1, col = "grey10")
)
if (!is.null(plot_cv_rmse)) {
print(plot_cv_rmse)
} else {
print("RMSE dotplot object was NULL or failed to generate.")
}
}, error = function(e_rmse_plot) {
print(paste("Error generating RMSE dotplot:", e_rmse_plot$message))
print("As a test, you could try running just: dotplot(results, metric='RMSE') in your console.")
})
# For Rsquared dotplot
tryCatch({
plot_cv_rsquared <- dotplot(results,
metric = "Rsquared",
main = list(label = "Model Comparison (Cross-Validation R-squared)", cex = 1.1, col = "grey10")
)
if (!is.null(plot_cv_rsquared)) {
print(plot_cv_rsquared)
} else {
print("Rsquared dotplot object was NULL or failed to generate.")
}
}, error = function(e_rsq_plot) {
print(paste("Error generating Rsquared dotplot:", e_rsq_plot$message))
print("As a test, you could try running just: dotplot(results, metric='Rsquared') in your console.")
})
} else {
# This message appears if there was an issue with the 'results' object itself.
print("No valid metric values for resamples plot, or 'results' object is NULL/NA.")
if(exists("results") && !is.null(results) && !is.null(results$values) && nrow(results$values) > 0 && any(is.na(results$values))){
# This specific check for NAs within results$values is important.
print("NA/NaN values were detected in resamples metrics earlier. This might be why plots cannot be generated.")
print("ACTION: Run warnings() in your R console to investigate specific warnings generated during training.")
}
}
} else {
# This message appears if the models themselves were not trained in the previous step.
print("One or more ML models were not trained, so skipping results summary and comparison plots.")
}## [1] "--- Resamples Summary (Cross-Validation Metrics) ---"
##
## Call:
## summary.resamples(object = results)
##
## Models: LM, Lasso, RF, XGB
## Number of resamples: 10
##
## MAE
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## LM 56.84831 66.88404 71.84790 70.12038 73.85008 78.75437 0
## Lasso 62.34046 66.55180 70.11223 69.94094 72.04063 80.57571 0
## RF 17.45075 21.17442 22.83042 22.78585 25.28783 26.62823 0
## XGB 16.26965 17.15206 17.87772 19.11131 20.26289 24.59136 0
##
## RMSE
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## LM 71.78156 79.95613 86.00672 84.34394 88.08922 96.33101 0
## Lasso 73.65286 82.07546 84.18448 84.17320 86.07776 96.60066 0
## RF 26.27526 30.04312 34.93598 33.81747 37.48122 39.86072 0
## XGB 22.83075 25.68027 27.64934 28.64640 31.36303 35.50130 0
##
## Rsquared
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## LM 0.6539824 0.6958971 0.7172894 0.7158083 0.7366345 0.7724865 0
## Lasso 0.6660041 0.7026623 0.7141840 0.7153712 0.7232549 0.7655254 0
## RF 0.9372410 0.9467224 0.9538050 0.9550157 0.9661711 0.9712157 0
## XGB 0.9517004 0.9622868 0.9709383 0.9677934 0.9741979 0.9782992 0
##
## [1] "--- Generating Model Comparison Dot Plots ---"
1. Summary of Cross-Validation Metrics (from Console Output)
The console output provides a statistical summary (Minimum, 1st Quartile, Median, Mean, 3rd Quartile, Maximum) for MAE, RMSE, and R-squared across the 10 resamples (folds) for each model.
a. Mean Absolute Error (MAE):
Insight: XGBoost has the lowest MAE, followed closely by Random Forest. Both tree-based models (RF and XGB) perform substantially better than the linear models (LM and Lasso) in terms of average prediction error.
b. Root Mean Squared Error (RMSE):
Insight: Similar to MAE, XGBoost achieves the lowest RMSE, followed by Random Forest. RMSE penalizes larger errors more heavily than MAE, and XGBoost’s superior performance here suggests it makes fewer large errors compared to RF, and significantly fewer than LM and Lasso.
c. R-squared (R²):
Insight: XGBoost has the highest R-squared, meaning it explains the most variance in the target variable, closely followed by Random Forest. Both RF and XGB explain over 95% of the variance on average, while LM and Lasso explain around 71-72%.
2. Analysis of Model Comparison Dot Plots
The dot plots visually confirm the findings from the summary table and provide insights into the consistency of model performance across the cross-validation folds. Each point on the dot plot represents the metric value for one fold, and the larger dot is typically the median.
a. RMSE Dot Plot:
Insight: The dot plot clearly shows a distinct performance advantage for RF and XGBoost over LM and Lasso in terms of RMSE. XGBoost is visually the best performer with the lowest and most tightly clustered RMSE values.
b. R-squared Dot Plot:
Insight: The dot plot confirms that RF and XGBoost provide a much better fit to the data than LM and Lasso. XGBoost again appears slightly superior to RF, with its R-squared values consistently at the top end.
Overall Conclusion:
Based on all three metrics (MAE, RMSE, and R-squared) from the 10-fold cross-validation:
XGBoost (XGB) consistently emerges as the best-performing model. It has the lowest median and mean MAE and RMSE, and the highest median and mean R-squared. Its performance is also relatively consistent across the cross-validation folds.
Random Forest (RF) is a strong second-best performer, significantly outperforming the linear models and achieving metrics close to XGBoost.
Linear Model (LM) and Lasso Regression perform substantially worse than RF and XGBoost on all metrics. Their error rates are much higher, and they explain significantly less variance in the data. Lasso offers a marginal improvement over the standard Linear Model in some cases but does not bridge the gap to the tree-based ensemble methods.
Therefore, for predictive accuracy and goodness-of-fit on this particular dataset and task, XGBoost is the preferred model, followed closely by Random Forest.
Cross-validation gave me a good idea of how my models might perform and helped me tune their hyperparameters. However, the true test of a model’s ability to generalise to new, unseen data is its performance on the test set, which I held back and did not use at all during training or tuning.
For this final evaluation, I will take my trained models (Linear
Regression, Lasso, Random Forest, and XGBoost, all using their tuned
hyperparameters where applicable) and use them to make predictions on
the test_data. I will then calculate the Root Mean Squared
Error (RMSE) for each model to compare their predictive accuracy on this
hold-out data.
I will also print out some key details for each final model, like the summary for the linear model and the best tuning parameters found for Lasso, Random Forest, and XGBoost. This helps document the final model configurations.
# This chunk assumes 'model_list_for_resamples' (containing all trained models)
# and 'test_data' are available from the previous chunks.
# Checking if I have the list of models and the test data available.
if (exists("model_list_for_resamples") &&
!is.null(model_list_for_resamples) &&
length(model_list_for_resamples) > 0 &&
exists("test_data") && !is.null(test_data) && nrow(test_data) > 0) {
print("--- RMSE on Test Set ---")
# Calculating RMSE for each model on the held-out test set.
# I am using yardstick::rmse_vec for this, which is part of the tidymodels ecosystem.
rmse_values <- sapply(model_list_for_resamples, function(mod) {
# Defensive check for a valid model object.
if(is.null(mod) || !("train" %in% class(mod))) {
print(paste("Warning: A model object in list is NULL or not a caret 'train' object for RMSE calculation."))
return(NA_real_) # Returning NA if the model object is not as expected.
}
tryCatch({ # Adding tryCatch in case prediction fails for a specific model on the test set.
predictions <- predict(mod, newdata = test_data) # Making predictions on the test set.
actuals <- test_data$rent_cpi # The true values from the test set.
# Basic checks before calculating RMSE to avoid errors.
if (length(predictions) != length(actuals)) {
print(paste("Prediction length mismatch for a model. Preds:", length(predictions), "Actuals:", length(actuals)))
return(NA_real_)
}
if (all(is.na(predictions))) {
print("All predictions are NA for a model on the test set.")
return(NA_real_)
}
# If there are some NAs in predictions or actuals (though recipe should handle NAs in predictors),
# rmse_vec will calculate RMSE on complete pairs.
if (any(is.na(predictions)) || any(is.na(actuals)) ) {
print("NAs found in predictions or actuals for test set RMSE calculation. RMSE will be calculated on complete pairs.")
}
return(yardstick::rmse_vec(truth = actuals, estimate = predictions))
}, error = function(e) {
print(paste("Error predicting with a model on test set:", e$message));
return(NA_real_)
})
})
print(rmse_values) # Displaying the test set RMSE for each model.
# --- Printing Key Model Summaries / Best Tunes ---
# This helps to document the final configuration of each model.
if(exists("model_lm") && !is.null(model_lm$finalModel)) {
print("--- Final Linear Model Details (trained on full training data) ---")
# The summary for lm shows coefficients, R-squared for the training data, etc.
# It also shows which coefficients were not defined due to singularities (multicollinearity).
print(summary(model_lm$finalModel))
}
if(exists("model_lasso") && !is.null(model_lasso$bestTune)) {
print(paste("--- Best tuning parameters for Lasso: lambda =", round(model_lasso$bestTune$lambda, 5)))
}
if(exists("model_rf") && !is.null(model_rf$bestTune)) {
print(paste("--- Best tuning parameters for Random Forest: mtry =", model_rf$bestTune$mtry))
# Variable importance can be very insightful for Random Forest.
# I can uncomment this to plot it if I want to see it in the report.
# print("Random Forest Variable Importance (Top 10):")
# rf_var_imp <- varImp(model_rf, scale = FALSE)
# print(plot(rf_var_imp, top = 10, main = "Top 10 Important Variables - Random Forest"))
}
if(exists("model_xgb") && !is.null(model_xgb$bestTune)) {
print("--- Best tuning parameters for XGBoost: ---")
print(model_xgb$bestTune)
# XGBoost also provides variable importance.
# print("XGBoost Variable Importance (Top 10):")
# xgb_var_imp <- varImp(model_xgb, scale = FALSE)
# print(plot(xgb_var_imp, top = 10, main = "Top 10 Important Variables - XGBoost"))
}
} else {
print("Model list for resamples not available, or test_data is empty/NULL. Skipping test set evaluation.")
}## [1] "--- RMSE on Test Set ---"
## LM Lasso RF XGB
## 76.54330 76.36756 28.53961 26.77484
## [1] "--- Final Linear Model Details (trained on full training data) ---"
##
## Call:
## lm(formula = .outcome ~ ., data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -143.734 -68.704 -4.457 53.174 264.002
##
## Coefficients: (7 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1244.0284 3.0117 413.067 < 2e-16 ***
## owner_ratio_national -118.0231 3.0932 -38.156 < 2e-16 ***
## price_rent_cpi_ratio 42.8195 3.1068 13.782 < 2e-16 ***
## price_growth_monthly -2.2454 3.3803 -0.664 0.507
## rent_growth_monthly 13.9697 3.1573 4.425 1.11e-05 ***
## month_Feb 4.3937 4.3275 1.015 0.310
## month_Mar 5.2725 4.4433 1.187 0.236
## month_Apr 3.5661 4.2618 0.837 0.403
## month_May 2.7986 4.2382 0.660 0.509
## month_Jun 3.5484 4.2178 0.841 0.400
## month_Jul -2.6366 4.2421 -0.622 0.534
## month_Aug -1.6050 4.2167 -0.381 0.704
## month_Sep -2.0301 4.1152 -0.493 0.622
## month_Oct -1.0603 4.2323 -0.251 0.802
## month_Nov 0.2852 4.2663 0.067 0.947
## month_Dec -0.3049 4.2419 -0.072 0.943
## month_new NA NA NA NA
## quarter_Q2 NA NA NA NA
## quarter_Q3 NA NA NA NA
## quarter_Q4 NA NA NA NA
## quarter_new NA NA NA NA
## half_year_H2 NA NA NA NA
## half_year_new NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 83.9 on 760 degrees of freedom
## Multiple R-squared: 0.7228, Adjusted R-squared: 0.7173
## F-statistic: 132.1 on 15 and 760 DF, p-value: < 2.2e-16
##
## [1] "--- Best tuning parameters for Lasso: lambda = 0.1"
## [1] "--- Best tuning parameters for Random Forest: mtry = 12"
## [1] "--- Best tuning parameters for XGBoost: ---"
## nrounds max_depth eta gamma colsample_bytree min_child_weight subsample
## 18 500 9 0.1 0 0.8 1 0.8
1. RMSE on Test Set
The R code calculates the RMSE for each model’s predictions on the test_data. The results are as follows:
Insight:
Consistent with the cross-validation results, the tree-based ensemble
models, XGBoost and Random Forest, significantly outperform the linear
models (LM and Lasso) on the test set.
XGBoost achieves the lowest RMSE (26.77), indicating it had the smallest
prediction errors on average on the unseen test data.
Random Forest is a close second with an RMSE of 28.54.
Lasso Regression (RMSE 76.37) performs marginally better than the
standard Linear Model (RMSE 76.54), but both have substantially higher
errors than RF and XGBoost.
This confirms that the superior performance of XGBoost and Random Forest
observed during cross-validation translates well to unseen data,
suggesting good generalization.
2. Final Linear Model (LM) Details
The summary output for the finalModel of the Linear Regression (trained on the full training dataset) provides several details:
Coefficients:
owner_ratio_national has a strong negative coefficient
(-118.02) and is highly statistically significant (p < 2e-16),
suggesting that as the national owner-occupied ratio decreases, the
predicted Rent CPI increases.price_rent_cpi_ratio has a positive coefficient (42.82)
and is highly statistically significant (p < 2e-16), indicating that
a higher price-to-rent CPI ratio is associated with a higher predicted
Rent CPI.rent_growth_monthly also has a positive coefficient
(13.97) and is highly significant (p = 1.11e-05).price_growth_monthly has a small negative coefficient
(-2.25) but is not statistically significant (p = 0.507).month_* dummy variables are not
statistically significant, suggesting limited monthly seasonality after
accounting for other factors.month_new, quarter_*,
half_year_*) were not defined due to multicollinearity
(perfect correlation between predictors), which is common when including
multiple related time-based dummy variables.Goodness-of-Fit (on training data):
Insight:
The linear model identifies owner_ratio_national,
price_rent_cpi_ratio, and rent_growth_monthly
as significant predictors of Rent CPI in the training data. However, its
overall predictive performance (R-squared on training data and RMSE on
test data) is much weaker than the ensemble methods.
3. Best Tuning Parameters for Tuned Models
The output also provides the optimal hyperparameters found during the cross-validation tuning process for Lasso, Random Forest, and XGBoost:
Lasso:
Random Forest (RF):
XGBoost (XGB):
nrounds = 500 (number of boosting rounds)max_depth = 9 (maximum tree depth)eta = 0.1 (learning rate)gamma = 0 (minimum loss reduction required to make a
further partition)colsample_bytree = 0.8 (subsample ratio of columns when
constructing each tree)min_child_weight = 1 (minimum sum of instance weight
needed in a child)subsample = 0.8 (subsample ratio of the training
instances)Insight:
These parameters represent the configurations that yielded the best performance for each respective model during the cross-validation tuning phase. These are the configurations used for the final model training and subsequent evaluation on the test set.
Overall Conclusion from Test Set Evaluation:
The test set evaluation confirms the hierarchy of model performance established during cross-validation:
This robust evaluation on a separate test set provides strong evidence that XGBoost, with its tuned hyperparameters, is the most suitable model among those tested for predicting Rent CPI in this context, offering the best generalization to new data.
While metrics like RMSE give me a good single number to compare models, it is also really useful to see how well the predictions line up with the actual values. For my best performing models (Random Forest and XGBoost, based on the test set RMSE), I decided to create a couple of diagnostic plots:
rent_cpi values from the test set on one axis
and the model’s predicted values on the other. If the model is making
good predictions, the points should cluster tightly around a diagonal
line where Actual = Predicted.These visual checks can often reveal patterns or issues that a single error metric might not capture.
# This chunk assumes 'model_rf', 'model_xgb' and 'test_data' exist from previous chunks.
# Plotting for Random Forest, which was one of my top models.
if (exists("model_rf") && !is.null(model_rf) &&
exists("test_data") && !is.null(test_data) && nrow(test_data) > 0) {
print("Plotting Random Forest: Actual vs Predicted, and Residuals, on Test Set...")
# Making predictions with the Random Forest model on the test data.
final_preds_rf <- predict(model_rf, newdata = test_data)
# Creating a dataframe for easy plotting.
plot_df_rf <- data.frame(
Actual = test_data$rent_cpi,
Predicted = as.numeric(final_preds_rf), # Ensuring predictions are numeric.
Date = test_data$date, # Including Date for the residual plot over time.
Region = test_data$region # Including Region for faceting the residual plot.
)
# --- Actual vs. Predicted Scatter Plot for Random Forest ---
pred_vs_actual_plot_rf <- ggplot(plot_df_rf, aes(x = Actual, y = Predicted)) +
geom_point(alpha = 0.6, colour = "forestgreen", size = 1.5) +
# Adding a y=x line for perfect agreement reference.
geom_abline(intercept = 0, slope = 1, linetype = "dashed", colour = "black", linewidth = 0.8) +
labs(
title = "Actual vs. Predicted Rent CPI (Random Forest - Test Set)",
x = "Actual Rent CPI",
y = "Predicted Rent CPI"
) +
coord_fixed() # Ensures a 1:1 aspect ratio, making it easier to judge predictions against the diagonal.
print(pred_vs_actual_plot_rf)
# --- Residuals Over Time Plot for Random Forest ---
# Calculating residuals (Actual - Predicted).
plot_df_residuals_rf <- plot_df_rf %>% mutate(Residuals = Actual - Predicted)
residuals_time_plot_rf <- ggplot(plot_df_residuals_rf, aes(x = Date, y = Residuals)) +
geom_line(alpha = 0.7, colour = "purple4") +
# Horizontal line at zero for reference (perfect prediction would have residuals of 0).
geom_hline(yintercept = 0, linetype = "dashed", colour = "black") +
# Faceting by region can show if the model struggles more with certain areas.
# Adjusting ncol to be at most 3, or fewer if fewer unique regions in test_data.
facet_wrap(~Region, scales = "free_y", ncol=min(3, length(unique(plot_df_residuals_rf$Region)))) +
labs(
title = "Residuals Over Time (Random Forest - Test Set)",
x = "Date",
y = "Residuals (Actual - Predicted)"
) +
theme(strip.text = element_text(size=rel(0.8))) # Adjusting facet label size.
print(residuals_time_plot_rf)
} else {
print("Random Forest model ('model_rf') or test_data not available. Skipping its actual vs. predicted plots.")
}## [1] "Plotting Random Forest: Actual vs Predicted, and Residuals, on Test Set..."
# Optional: Plotting for XGBoost as well, given it was also a strong performer.
if (exists("model_xgb") && !is.null(model_xgb) &&
exists("test_data") && !is.null(test_data) && nrow(test_data) > 0) {
print("Plotting XGBoost Actual vs Predicted on Test Set...")
final_preds_xgb <- predict(model_xgb, newdata = test_data)
plot_df_xgb <- data.frame(
Actual = test_data$rent_cpi,
Predicted = as.numeric(final_preds_xgb)
# Can add Date and Region here too if I want to plot XGBoost residuals like for RF.
# Date = test_data$date,
# Region = test_data$region
)
pred_vs_actual_plot_xgb <- ggplot(plot_df_xgb, aes(x = Actual, y = Predicted)) +
geom_point(alpha = 0.5, colour = "firebrick") +
geom_abline(intercept = 0, slope = 1, linetype = "dashed", colour = "black", linewidth = 0.8) +
labs(
title = "Actual vs. Predicted Rent CPI (XGBoost Model - Test Set)",
x = "Actual Rent CPI",
y = "Predicted Rent CPI"
) +
coord_fixed()
print(pred_vs_actual_plot_xgb)
# Example of how I could do residuals for XGBoost if I wanted:
# plot_df_residuals_xgb <- plot_df_xgb %>% mutate(Residuals = Actual - Predicted, Date = test_data$date, Region = test_data$region)
# residuals_time_plot_xgb <- ggplot(plot_df_residuals_xgb, aes(x = Date, y = Residuals)) +
# geom_line(alpha = 0.7, colour = "orangered3") +
# geom_hline(yintercept = 0, linetype = "dashed", colour = "black") +
# facet_wrap(~Region, scales = "free_y", ncol=min(3, length(unique(plot_df_residuals_xgb$Region)))) +
# labs(title = "Residuals Over Time (XGBoost - Test Set)", x = "Date", y = "Residuals") +
# theme(strip.text = element_text(size=rel(0.8)))
# print(residuals_time_plot_xgb)
} else {
print("XGBoost model ('model_xgb') or test_data not available. Skipping its actual vs. predicted plots.")
}## [1] "Plotting XGBoost Actual vs Predicted on Test Set..."
1. Actual vs. Predicted Rent CPI – Random Forest (Test Set)
The scatter plot shows that Random Forest predictions align well with
actual Rent CPI values, forming a tight cluster around the ideal line.
The model handles the full CPI range (1000–1550) and has minimal bias.
Slight deviations are seen at higher CPI levels (over 1400), where
prediction errors increase mildly.
Insight: Random Forest predicts Rent CPI with high
accuracy overall, showing good alignment with actual values and only
minor issues at higher price levels.
2. Residuals Over Time by Region – Random Forest (Test Set)
The residuals generally fluctuate around zero, indicating that the
model doesn’t suffer from major systematic bias. However, there are
differences across regions and time:
- Auckland: Residuals are fairly stable and mostly
within ±25 CPI units.
- Canterbury: More volatile residuals, with
under-prediction around 2012–2014 and over-prediction around
2017–2018.
- National: Slight under-prediction trend in later
years (2016–2019).
- Rest of North Island: Some under-prediction spikes
around 2019–2020.
- Rest of South Island: Noticeable volatility,
including consistent under-prediction (2011–2012) and a strong spike in
2020.
- Wellington: Earlier predictions are accurate, but
from 2014 onward there’s a visible shift — first over-prediction, then
under-prediction.
Insight: While Random Forest performs well overall, residual behavior varies by region and time. Certain regions—especially Canterbury, Rest of South Island, and Wellington—show patterns of systematic over- or under-prediction during specific periods, indicating uneven regional accuracy.
3. Actual vs. Predicted Rent CPI – XGBoost (Test Set)
XGBoost predictions closely track actual CPI values, forming a tight and consistent pattern along the diagonal. The spread and accuracy are comparable to Random Forest but slightly better in RMSE.
Insight: XGBoost matches Random Forest in visual predictive performance and slightly exceeds it in precision. The model shows strong generalization and robustness with minimal bias across the full CPI range.
Overall Insight:
Both Random Forest and XGBoost deliver high predictive accuracy on unseen test data, with XGBoost marginally outperforming. However, regional and temporal residual analysis reveals that model performance isn’t uniform. Some regions and time periods show deviations, particularly for Random Forest. This suggests potential areas for future model refinement, especially for capturing regional dynamics more accurately.
After building and evaluating my regression models for predicting
Rent CPI, I wanted to explore more traditional time series forecasting
techniques as well. My goal here is to predict Rent CPI out to the end
of 2030. I decided to try two main approaches for a selected region (I
will use “Auckland” as the primary example in this report, but my R
script is set up to loop through other regions too if I change the
unique_regions_to_forecast variable):
median_price).First, I need to set up some general parameters for the forecasting horizon and ensure my previously trained models and recipes are ready if I need them.
# --- General Setup for My Forecasting Experiments ---
unique_regions_to_forecast <- unique(full_data$region)
# Defining how far out I want to forecast.
last_historical_date <- max(full_data$date, na.rm = TRUE)
forecast_horizon_start_date <- last_historical_date %m+% months(1) # Starting from the month after my last data point.
forecast_horizon_end_date <- make_date(2030, 12, 1) # My target end date.
future_dates_for_forecast <- seq(from = forecast_horizon_start_date, to = forecast_horizon_end_date, by = "month")
num_forecast_periods <- length(future_dates_for_forecast) # How many months this is.
print(paste("--- Initialising Forecasting for Region(s):", paste(unique_regions_to_forecast, collapse=", "), "---"))## [1] "--- Initialising Forecasting for Region(s): Auckland, Wellington, Rest of North Island, Canterbury, Rest of South Island, National ---"
print(paste("Forecasting for", num_forecast_periods, "months, from",
format(forecast_horizon_start_date, "%Y-%m"), "to", format(forecast_horizon_end_date, "%Y-%m")))## [1] "Forecasting for 126 months, from 2020-07 to 2030-12"
# Making sure my prepped recipe (from the ML section) is available, as I will need it for the Random Forest forecast.
if (!exists("prepped_recipe_for_models") || !"recipe" %in% class(prepped_recipe_for_models) || is.null(prepped_recipe_for_models$tr_info)) {
# This check ensures that if I run this section independently, the recipe is prepped.
# It assumes 'rec' and 'train_data' are loaded from the ML section.
if(exists("rec") && exists("train_data")){
print("Note: 'prepped_recipe_for_models' was not found globally, re-prepping from 'rec' and 'train_data'.")
prepped_recipe_for_models <- prep(rec, training = train_data)
} else {
stop("The recipe ('rec') and 'train_data' are not available to prep for forecasting. Please ensure the ML section has run.")
}
}
# Making sure my Random Forest model is loaded and ready.
if (!exists("model_rf") || !("train" %in% class(model_rf))) {
stop("My Random Forest model ('model_rf') is not here. I need to run the ML training part first for RF forecasting.")
}
active_forecasting_model_rf <- model_rf # This is the RF model I plan to use for one of the methods.
# For the recursive RF forecast, I'll need national dwelling ratios.
# I am just going to hold these constant at their last known values for simplicity in these forecasts.
last_national_ratios_fc <- full_data %>%
arrange(desc(date)) %>%
# distinct() on these columns should give one row if they are truly national and consistent per date.
distinct(rented_ratio_national, owner_ratio_national, free_ratio_national) %>%
dplyr::slice(1) %>% # Taking the most recent set of distinct ratios.
select(rented_ratio_national, owner_ratio_national, free_ratio_national)
if(nrow(last_national_ratios_fc) == 0 || any(is.na(last_national_ratios_fc))){
print("Warning: Couldn't get complete last national ratios for forecasting. Using placeholder values for RF method.")
last_national_ratios_fc <- data.frame(rented_ratio_national=0.3, owner_ratio_national=0.6, free_ratio_national=0.1)
}
# Initialising lists to store all forecasts from different methods and regions if I loop later.
all_regional_arima_cpi_forecasts <- list()
all_regional_rf_cpi_forecasts <- list()For this first forecasting method, I am using the
auto.arima() function from the forecast
package. This function is pretty handy as it automatically tries to
select the best ARIMA(p,d,q)(P,D,Q)[m] model based on information
criteria (like AICc). This approach just looks at the past values of the
Rent CPI for a specific region to make its forecasts. It does not use
any of the other predictors from my full_data table.
My process for each region (just “Auckland” in this R Markdown output
for now, as defined in my unique_regions_to_forecast
variable in the setup chunk) involves: 1. Extracting the historical Rent
CPI for the chosen region. 2. Converting it to a time series
(ts) object, which is what auto.arima needs.
3. Fitting the auto.arima model, allowing it to search for
seasonality, drift, and apply a Box-Cox transformation if it helps. 4.
Checking the residuals of the fitted model to make sure it is a decent
fit (i.e., residuals should ideally look like white noise). 5.
Generating the forecast for the desired number of periods (up to the end
of 2030). 6. Plotting the forecast, including the 80% and 95% prediction
intervals to show the uncertainty.
# This R chunk now contains its own loop for Method 1.
# It will iterate through each region in 'unique_regions_to_forecast'.
# Initialising a list to store ARIMA forecast objects if I run for multiple regions
# Note: all_regional_arima_cpi_forecasts was already initialised in the general setup chunk.
# if (!exists("all_regional_arima_cpi_forecasts")) all_regional_arima_cpi_forecasts <- list()
for (current_forecast_region in unique_regions_to_forecast) {
print(paste0("\n === METHOD 1: Processing Direct ARIMA for Rent CPI in Region: ", current_forecast_region, " ==="))
regional_rent_cpi_history <- full_data %>%
filter(region == current_forecast_region & !is.na(rent_cpi)) %>%
select(date, rent_cpi) %>%
arrange(date)
regional_arima_cpi_forecast_obj_loop <- NULL
if(nrow(regional_rent_cpi_history) < 36) {
print(paste("Not enough historical Rent CPI data for ARIMA for region:", current_forecast_region, "- Skipping."))
all_regional_arima_cpi_forecasts[[current_forecast_region]] <- NULL
next # Move to the next region in the loop
}
start_year_arima_cpi <- lubridate::year(min(regional_rent_cpi_history$date))
start_month_arima_cpi <- lubridate::month(min(regional_rent_cpi_history$date))
ts_regional_cpi_direct <- ts(regional_rent_cpi_history$rent_cpi,
start = c(start_year_arima_cpi, start_month_arima_cpi),
frequency = 12)
plot_ts_object <- autoplot(ts_regional_cpi_direct) +
ggtitle(paste("Monthly Rent CPI -", current_forecast_region)) +
xlab("Year") + ylab("Rent CPI") +
theme(plot.title = element_text(hjust = 0.5))
print(plot_ts_object)
print(paste("Fitting auto.arima for Rent CPI in", current_forecast_region, "..."))
regional_arima_cpi_model_direct <- NULL
tryCatch({
regional_arima_cpi_model_direct <- forecast::auto.arima(ts_regional_cpi_direct,
seasonal = TRUE,
stepwise = FALSE, approximation = FALSE,
allowdrift = TRUE,
lambda = "auto")
print(paste("ARIMA model fitted for Rent CPI in", current_forecast_region, ":", regional_arima_cpi_model_direct$method))
print(summary(regional_arima_cpi_model_direct))
print(paste("Residual diagnostics for Rent CPI ARIMA model of", current_forecast_region, ":"))
forecast::checkresiduals(regional_arima_cpi_model_direct)
regional_arima_cpi_forecast_obj_loop <- forecast::forecast(regional_arima_cpi_model_direct, h = num_forecast_periods)
all_regional_arima_cpi_forecasts[[current_forecast_region]] <- tibble(
date = future_dates_for_forecast,
region = current_forecast_region,
rent_cpi_forecast_arima = as.numeric(regional_arima_cpi_forecast_obj_loop$mean),
lower80_arima = as.numeric(regional_arima_cpi_forecast_obj_loop$lower[,1]),
upper80_arima = as.numeric(regional_arima_cpi_forecast_obj_loop$upper[,1]),
lower95_arima = as.numeric(regional_arima_cpi_forecast_obj_loop$lower[,2]),
upper95_arima = as.numeric(regional_arima_cpi_forecast_obj_loop$upper[,2])
)
plot_arima_direct_fc <- autoplot(regional_arima_cpi_forecast_obj_loop) +
ggtitle(paste0(num_forecast_periods/12, "-Year Rent CPI Forecast for ", current_forecast_region, " (ARIMA)"),
subtitle = paste("Model:", regional_arima_cpi_model_direct$method)) +
xlab("Year") + ylab("Rent CPI Forecast") +
guides(fill=guide_legend(title="Prediction Interval:")) +
theme(plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5))
print(plot_arima_direct_fc)
print(paste("ARIMA Forecasted values (first 6 months) for Rent CPI in", current_forecast_region, ":"))
if(!is.null(regional_arima_cpi_forecast_obj_loop)) print(head(regional_arima_cpi_forecast_obj_loop$mean))
}, error = function(e_arima_cpi) {
print(paste("Direct ARIMA modelling for Rent CPI failed for", current_forecast_region, ":", e_arima_cpi$message))
all_regional_arima_cpi_forecasts[[current_forecast_region]] <- NULL
})
} # End of the 'for' loop for current_forecast_region (for Method 1)## [1] "\n === METHOD 1: Processing Direct ARIMA for Rent CPI in Region: Auckland ==="
Plot 7.1: Direct ARIMA Forecasts for Rent CPI (for selected regions). Caption will update per plot if multiple regions run.
## [1] "Fitting auto.arima for Rent CPI in Auckland ..."
## [1] "ARIMA model fitted for Rent CPI in Auckland : "
## Series: ts_regional_cpi_direct
## ARIMA(2,1,0)(1,0,0)[12] with drift
## Box Cox transformation: lambda= 0.789686
##
## Coefficients:
## ar1 ar2 sar1 drift
## -0.3077 -0.2904 0.2328 0.6876
## s.e. 0.0773 0.0774 0.0862 0.1497
##
## sigma^2 = 5.766: log likelihood = -367.91
## AIC=745.81 AICc=746.2 BIC=761.22
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set -0.02772669 10.68066 8.632323 -0.001775353 0.6794153 0.2126537 -0.0244276
## [1] "Residual diagnostics for Rent CPI ARIMA model of Auckland :"
Plot 7.1: Direct ARIMA Forecasts for Rent CPI (for selected regions). Caption will update per plot if multiple regions run.
##
## Ljung-Box test
##
## data: Residuals from ARIMA(2,1,0)(1,0,0)[12] with drift
## Q* = 22.651, df = 21, p-value = 0.3629
##
## Model df: 3. Total lags used: 24
Plot 7.1: Direct ARIMA Forecasts for Rent CPI (for selected regions). Caption will update per plot if multiple regions run.
## [1] "ARIMA Forecasted values (first 6 months) for Rent CPI in Auckland :"
## Jul Aug Sep Oct Nov Dec
## 2020 1513.793 1518.572 1515.413 1519.628 1530.212 1532.353
## [1] "\n === METHOD 1: Processing Direct ARIMA for Rent CPI in Region: Wellington ==="
Plot 7.1: Direct ARIMA Forecasts for Rent CPI (for selected regions). Caption will update per plot if multiple regions run.
## [1] "Fitting auto.arima for Rent CPI in Wellington ..."
## [1] "ARIMA model fitted for Rent CPI in Wellington : "
## Series: ts_regional_cpi_direct
## ARIMA(0,1,1)(0,1,2)[12]
## Box Cox transformation: lambda= -0.8999268
##
## Coefficients:
## ma1 sma1 sma2
## -0.4992 -0.5755 -0.1749
## s.e. 0.0692 0.1020 0.0987
##
## sigma^2 = 1.1e-07: log likelihood = 1385.71
## AIC=-2763.41 AICc=-2763.14 BIC=-2751.4
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set 0.773162 46.95124 20.49915 0.08885825 1.740393 0.4659603 0.1495763
## [1] "Residual diagnostics for Rent CPI ARIMA model of Wellington :"
Plot 7.1: Direct ARIMA Forecasts for Rent CPI (for selected regions). Caption will update per plot if multiple regions run.
##
## Ljung-Box test
##
## data: Residuals from ARIMA(0,1,1)(0,1,2)[12]
## Q* = 11.565, df = 21, p-value = 0.9506
##
## Model df: 3. Total lags used: 24
Plot 7.1: Direct ARIMA Forecasts for Rent CPI (for selected regions). Caption will update per plot if multiple regions run.
## [1] "ARIMA Forecasted values (first 6 months) for Rent CPI in Wellington :"
## Jul Aug Sep Oct Nov Dec
## 2020 1553.909 1566.575 1569.266 1580.762 1609.558 1594.449
## [1] "\n === METHOD 1: Processing Direct ARIMA for Rent CPI in Region: Rest of North Island ==="
Plot 7.1: Direct ARIMA Forecasts for Rent CPI (for selected regions). Caption will update per plot if multiple regions run.
## [1] "Fitting auto.arima for Rent CPI in Rest of North Island ..."
## [1] "ARIMA model fitted for Rent CPI in Rest of North Island : "
## Series: ts_regional_cpi_direct
## ARIMA(2,1,0) with drift
## Box Cox transformation: lambda= -0.8999268
##
## Coefficients:
## ar1 ar2 drift
## -0.4405 -0.1913 0e+00
## s.e. 0.0872 0.0814 1e-04
##
## sigma^2 = 7.95e-09: log likelihood = 1585.65
## AIC=-3163.31 AICc=-3163.05 BIC=-3150.98
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set 2.716329 30.16879 9.516836 0.2392734 0.8159561 0.223016 0.02392561
## [1] "Residual diagnostics for Rent CPI ARIMA model of Rest of North Island :"
Plot 7.1: Direct ARIMA Forecasts for Rent CPI (for selected regions). Caption will update per plot if multiple regions run.
##
## Ljung-Box test
##
## data: Residuals from ARIMA(2,1,0) with drift
## Q* = 0.81089, df = 22, p-value = 1
##
## Model df: 2. Total lags used: 24
Plot 7.1: Direct ARIMA Forecasts for Rent CPI (for selected regions). Caption will update per plot if multiple regions run.
## [1] "ARIMA Forecasted values (first 6 months) for Rent CPI in Rest of North Island :"
## Jul Aug Sep Oct Nov Dec
## 2020 1618.724 1623.700 1630.576 1636.436 1642.448 1648.660
## [1] "\n === METHOD 1: Processing Direct ARIMA for Rent CPI in Region: Canterbury ==="
Plot 7.1: Direct ARIMA Forecasts for Rent CPI (for selected regions). Caption will update per plot if multiple regions run.
## [1] "Fitting auto.arima for Rent CPI in Canterbury ..."
## [1] "ARIMA model fitted for Rent CPI in Canterbury : "
## Series: ts_regional_cpi_direct
## ARIMA(1,1,0)(1,0,0)[12] with drift
## Box Cox transformation: lambda= 0.9358105
##
## Coefficients:
## ar1 sar1 drift
## -0.4158 0.3046 1.430
## s.e. 0.0733 0.0812 0.885
##
## sigma^2 = 132.2: log likelihood = -620.81
## AIC=1249.61 AICc=1249.87 BIC=1261.94
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set -0.07469888 17.99377 13.75772 -0.007292928 1.086328 0.2930917 -0.0247203
## [1] "Residual diagnostics for Rent CPI ARIMA model of Canterbury :"
Plot 7.1: Direct ARIMA Forecasts for Rent CPI (for selected regions). Caption will update per plot if multiple regions run.
##
## Ljung-Box test
##
## data: Residuals from ARIMA(1,1,0)(1,0,0)[12] with drift
## Q* = 30.209, df = 22, p-value = 0.1135
##
## Model df: 2. Total lags used: 24
Plot 7.1: Direct ARIMA Forecasts for Rent CPI (for selected regions). Caption will update per plot if multiple regions run.
## [1] "ARIMA Forecasted values (first 6 months) for Rent CPI in Canterbury :"
## Jul Aug Sep Oct Nov Dec
## 2020 1388.796 1391.117 1401.172 1386.604 1407.331 1409.722
## [1] "\n === METHOD 1: Processing Direct ARIMA for Rent CPI in Region: Rest of South Island ==="
Plot 7.1: Direct ARIMA Forecasts for Rent CPI (for selected regions). Caption will update per plot if multiple regions run.
## [1] "Fitting auto.arima for Rent CPI in Rest of South Island ..."
## [1] "ARIMA model fitted for Rent CPI in Rest of South Island : "
## Series: ts_regional_cpi_direct
## ARIMA(0,1,1)(1,0,0)[12]
## Box Cox transformation: lambda= -0.6768916
##
## Coefficients:
## ma1 sar1
## -0.6015 0.6269
## s.e. 0.0719 0.0604
##
## sigma^2 = 4.287e-08: log likelihood = 1165.43
## AIC=-2324.86 AICc=-2324.71 BIC=-2315.62
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set 2.236949 29.5682 20.21368 0.1786773 1.658088 0.5005038 0.01989861
## [1] "Residual diagnostics for Rent CPI ARIMA model of Rest of South Island :"
Plot 7.1: Direct ARIMA Forecasts for Rent CPI (for selected regions). Caption will update per plot if multiple regions run.
##
## Ljung-Box test
##
## data: Residuals from ARIMA(0,1,1)(1,0,0)[12]
## Q* = 13.782, df = 22, p-value = 0.909
##
## Model df: 2. Total lags used: 24
Plot 7.1: Direct ARIMA Forecasts for Rent CPI (for selected regions). Caption will update per plot if multiple regions run.
## [1] "ARIMA Forecasted values (first 6 months) for Rent CPI in Rest of South Island :"
## Jul Aug Sep Oct Nov Dec
## 2020 1415.177 1418.078 1431.342 1412.851 1424.438 1409.355
## [1] "\n === METHOD 1: Processing Direct ARIMA for Rent CPI in Region: National ==="
Plot 7.1: Direct ARIMA Forecasts for Rent CPI (for selected regions). Caption will update per plot if multiple regions run.
## [1] "Fitting auto.arima for Rent CPI in National ..."
## [1] "ARIMA model fitted for Rent CPI in National : "
## Series: ts_regional_cpi_direct
## ARIMA(0,1,1)(1,0,0)[12] with drift
## Box Cox transformation: lambda= -0.101685
##
## Coefficients:
## ma1 sar1 drift
## -0.2623 0.5669 0.0011
## s.e. 0.0761 0.0674 0.0004
##
## sigma^2 = 8.944e-06: log likelihood = 707.9
## AIC=-1407.79 AICc=-1407.54 BIC=-1395.47
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set -0.1265951 7.771131 5.86455 -0.008092995 0.4700289 0.1529883 -0.006882079
## [1] "Residual diagnostics for Rent CPI ARIMA model of National :"
Plot 7.1: Direct ARIMA Forecasts for Rent CPI (for selected regions). Caption will update per plot if multiple regions run.
##
## Ljung-Box test
##
## data: Residuals from ARIMA(0,1,1)(1,0,0)[12] with drift
## Q* = 19.179, df = 22, p-value = 0.6342
##
## Model df: 2. Total lags used: 24
Plot 7.1: Direct ARIMA Forecasts for Rent CPI (for selected regions). Caption will update per plot if multiple regions run.
## [1] "ARIMA Forecasted values (first 6 months) for Rent CPI in National :"
## Jul Aug Sep Oct Nov Dec
## 2020 1502.499 1510.330 1514.173 1510.031 1532.657 1528.003
Auckland: Forecast rises to ~1900–2000;
uncertainty increases over time.
Insight: Captures strong trend; future values are
uncertain but growth is expected.
Canterbury: Modest rise to ~1600–1700; wide
prediction intervals.
Insight: Forecast suggests slow recovery; seasonal
effects may be under-modeled.
National: Consistent growth forecast toward
~2000; expanding uncertainty.
Insight: Forecast aligns with historical trend;
generally reliable.
Rest of North Island: Forecast surges beyond
2800; aggressive trend continuation.
Insight: Very steep forecast; reflects trend but lacks
seasonality—use caution.
Rest of South Island: Forecast remains flat at
~1300–1350; wide intervals.
Insight: Recent drop heavily influences trend; forecast
highly uncertain and potentially unreliable.
Wellington: Forecast exceeds 2500; captures
cyclical trend with good fit.
Insight: Strong growth projection with seasonality
modeled; solid statistical foundation.
My second approach to forecasting Rent CPI uses the Random Forest
model (model_rf) that performed well in the machine
learning section. This method is a bit more complex than the direct
ARIMA on Rent CPI because my Random Forest model relies on several other
predictor variables. To use it for forecasting, I need to have future
values for all these predictors.
The main steps for this method, for each region I am analysing (just
“Auckland” for this R Markdown output, as defined in
unique_regions_to_forecast), are:
Forecast Key Exogenous Predictors: The most
important predictor that itself changes significantly over time is
median_price. So, for the current region, I will first
create an ARIMA forecast for its median_price out to 2030.
I decided to use a log transformation (lambda = 0) for this
median_price forecast, as that often helps produce more
stable and sensible long-term trends for price data. Other predictors,
like the national dwelling ratios, I will hold constant at their last
known values for this forecasting exercise as a simplification.
Recursive Prediction for Rent CPI: Once I have
the future path for median_price (and assumptions for other
external predictors), I will predict rent_cpi one month at
a time using my trained Random Forest model:
rent_cpi for that month.rent_cpi will then be used to help
create the lagged features for the next month’s prediction, and
so on. This step-by-step process is why it is called a “recursive”
forecast.Approximation for
price_rent_cpi_ratio: One of the features my
Random Forest model was trained on is
price_rent_cpi_ratio = median_price / rent_cpi. When I am
forecasting for a future month, I do not yet know the
rent_cpi for that exact month to calculate this ratio. So,
for the purpose of creating inputs for the forecast, I will approximate
this feature by using the previous month’s
rent_cpi (which will be either the last known actual value
or the value I just forecasted in the previous step) in the denominator:
price_rent_cpi_ratio_approx = median_price_future / rent_cpi_previous_month.
This is a workaround for recursive forecasting when such ratios are used
in the model. It is an approximation, and for an even more robust
approach in future work, I might consider retraining my Random Forest
model with a feature that is already based on lagged CPI, like
median_price / lag(rent_cpi, 1).
Now, here is the R code that implements this recursive Random Forest forecasting method.
# This R chunk contains its own loop for Method 2 (Recursive RF).
# It will iterate through each region in 'unique_regions_to_forecast'.
for (current_forecast_region in unique_regions_to_forecast) {
print(paste0("\n === METHOD 2: Processing Recursive RF Forecast for Rent CPI in ", current_forecast_region, " ==="))
# --- Step 2a: Forecast 'median_price' for current_forecast_region using ARIMA ---
# This forecasted median_price will be an input to my RF model for Rent CPI.
print(paste("--- Sub-step for Method 2: Forecasting median_price for", current_forecast_region, "using ARIMA ---"))
historical_median_price_data_rf <- full_data %>%
filter(region == current_forecast_region & !is.na(median_price)) %>%
select(date, median_price) %>%
arrange(date)
future_median_prices_for_rf_loop <- NULL # Initialise this for each region within the loop
if(nrow(historical_median_price_data_rf) < 24) { # Need at least a couple of years for a decent ARIMA
print(paste("Warning: Not enough historical median_price data for", current_forecast_region,
"to fit ARIMA. Holding last known price constant for RF input."))
if(nrow(historical_median_price_data_rf) > 0) {
future_median_prices_for_rf_loop <- rep(tail(historical_median_price_data_rf$median_price, 1), num_forecast_periods)
} else {
# If absolutely no historical median price, RF forecasting for this region will be problematic.
future_median_prices_for_rf_loop <- rep(NA, num_forecast_periods)
print(paste("CRITICAL WARNING: No median price data at all for", current_forecast_region,
". RF forecast's median_price input will be NA, which will likely cause issues."))
}
} else {
median_price_ts_rf <- ts(historical_median_price_data_rf$median_price,
start = c(year(min(historical_median_price_data_rf$date)),
month(min(historical_median_price_data_rf$date))),
frequency = 12)
arima_median_price_model_rf_loop <- NULL # Initialise model object for this region
tryCatch({
# Using lambda=0 (log transform) for median_price as discussed,
# to try and get a more sensible long-term trend for this volatile series.
arima_median_price_model_rf_loop <- forecast::auto.arima(median_price_ts_rf,
seasonal = TRUE, stepwise = TRUE, approximation = FALSE,
allowdrift = TRUE, lambda = 0)
future_median_prices_forecast_obj_rf_loop <- forecast::forecast(arima_median_price_model_rf_loop, h = num_forecast_periods)
future_median_prices_for_rf_loop <- as.numeric(future_median_prices_forecast_obj_rf_loop$mean)
print(paste("ARIMA for median_price (input for RF) in", current_forecast_region, "fitted:", arima_median_price_model_rf_loop$method))
# I can plot this helper forecast to see what median_price projection is being used.
plot_median_price_fc_rf <- autoplot(future_median_prices_forecast_obj_rf_loop) +
ggtitle(paste("Helper: Median Price ARIMA Forecast for", current_forecast_region,"(for RF)"),
subtitle = paste("Model:", arima_median_price_model_rf_loop$method)) +
scale_y_continuous(labels = dollar_format(prefix="$", scale=1/1000, suffix="K")) +
xlab("Year") + ylab("Median Price Forecast")
print(plot_median_price_fc_rf)
}, error = function(e_arima_mp) {
print(paste("ARIMA for median_price input (for RF) failed for", current_forecast_region, ":", e_arima_mp$message))
# Fallback if ARIMA for median price fails.
if(nrow(historical_median_price_data_rf) > 0) {
# Using <<- to assign to variable in parent environment of the tryCatch error function.
future_median_prices_for_rf_loop <<- rep(tail(historical_median_price_data_rf$median_price, 1), num_forecast_periods)
} else {
future_median_prices_for_rf_loop <<- rep(NA, num_forecast_periods)
}
})
}
# Proceed with RF forecast for this region only if I have future median prices.
if(is.null(future_median_prices_for_rf_loop) || all(is.na(future_median_prices_for_rf_loop))) {
print(paste("Cannot proceed with RF forecast for", current_forecast_region, "due to missing/failed future median price forecast."))
all_regional_rf_cpi_forecasts[[current_forecast_region]] <- NULL # Store NULL if skipped
} else {
# --- 2b. Prepare Last Known Actual Data for current_forecast_region for RF ---
# This is needed to bootstrap the recursive process (lags for the first forecast step).
last_observation_rf_startup <- full_data %>%
filter(region == current_forecast_region) %>%
filter(date == max(date, na.rm = TRUE)) %>%
# Selecting columns that match ml_data_final structure, plus original lags from full_data.
select(any_of(c(colnames(ml_data_final), "rent_cpi_lag", "median_price_lag", "price_growth_monthly", "rent_growth_monthly"))) %>%
# Re-ensure date components are present as they were in ml_data_final (used by recipe).
mutate(
year = lubridate::year(date),
month = lubridate::month(date),
quarter = lubridate::quarter(date),
half_year = ifelse(month %in% 1:6, 1, 2)
)
if(nrow(last_observation_rf_startup) == 0) {
print(paste("No final historical data row found for RF for", current_forecast_region, ". Skipping RF forecast for this region."))
all_regional_rf_cpi_forecasts[[current_forecast_region]] <- NULL
} else {
# Initialise values for the loop based on the true last observation (time t)
rent_cpi_current_is_previous_for_next <- last_observation_rf_startup$rent_cpi
# and the observation before that (time t-1) for growth calculations
rent_cpi_previous_is_two_ago_for_next <- last_observation_rf_startup$rent_cpi_lag
median_price_current_is_previous_for_next <- last_observation_rf_startup$median_price
rent_cpi_forecasts_rf_list_region <- list() # To store this region's RF forecasts
for (i in 1:num_forecast_periods) {
current_future_date_rf <- future_dates_for_forecast[i] # This is the date for t+i
# --- Constructing features for this future month (t+i) ---
year_fut_rf = lubridate::year(current_future_date_rf)
month_fut_rf = lubridate::month(current_future_date_rf) # Will be factorised by recipe
quarter_fut_rf = lubridate::quarter(current_future_date_rf) # Will be factorised by recipe
half_year_fut_rf = ifelse(month_fut_rf %in% 1:6, 1, 2) # Will be factorised by recipe
median_price_fut_rf = future_median_prices_for_rf_loop[i] # This is the ARIMA-forecasted median_price for t+i
log_median_price_fut_rf = ifelse(!is.na(median_price_fut_rf) & median_price_fut_rf > 0, log(median_price_fut_rf), NA)
# Price growth for t+i: (price(t+i) - price(t+i-1)) / price(t+i-1)
# price(t+i-1) is median_price_current_is_previous_for_next
price_growth_monthly_fut_rf <- ifelse(!is.na(median_price_current_is_previous_for_next) & median_price_current_is_previous_for_next != 0 & !is.na(median_price_fut_rf),
(median_price_fut_rf - median_price_current_is_previous_for_next) / median_price_current_is_previous_for_next, NA)
# Fallback for the very first forecast step if the immediate lag needed for growth was NA from history
if (i == 1 && is.na(price_growth_monthly_fut_rf) && !is.na(last_observation_rf_startup$price_growth_monthly)) {
price_growth_monthly_fut_rf <- last_observation_rf_startup$price_growth_monthly
}
# National dwelling ratios (held constant as per last_national_ratios_fc)
rented_ratio_national_fut_rf <- last_national_ratios_fc$rented_ratio_national
owner_ratio_national_fut_rf <- last_national_ratios_fc$owner_ratio_national
free_ratio_national_fut_rf <- last_national_ratios_fc$free_ratio_national
# Rent growth for t+i: (rent_cpi(t+i-1) - rent_cpi(t+i-2)) / rent_cpi(t+i-2)
# rent_cpi(t+i-1) is rent_cpi_current_is_previous_for_next
# rent_cpi(t+i-2) is rent_cpi_previous_is_two_ago_for_next
rent_growth_monthly_fut_rf <- ifelse(!is.na(rent_cpi_current_is_previous_for_next) & !is.na(rent_cpi_previous_is_two_ago_for_next) & rent_cpi_previous_is_two_ago_for_next != 0,
(rent_cpi_current_is_previous_for_next - rent_cpi_previous_is_two_ago_for_next) / rent_cpi_previous_is_two_ago_for_next, NA)
# Fallback for the very first forecast step
if (i == 1 && is.na(rent_growth_monthly_fut_rf) && !is.na(last_observation_rf_startup$rent_growth_monthly) ) {
rent_growth_monthly_fut_rf <- last_observation_rf_startup$rent_growth_monthly
}
# Approximated price_rent_cpi_ratio using t-1 rent_cpi (rent_cpi_current_is_previous_for_next)
# My model was trained with contemporaneous rent_cpi in this ratio. This is an approximation for forecasting.
price_rent_cpi_ratio_fut_approx_rf <- ifelse(rent_cpi_current_is_previous_for_next > 0 & !is.na(median_price_fut_rf),
median_price_fut_rf / rent_cpi_current_is_previous_for_next, NA)
# Assembling the feature row. Must match the structure the recipe expects.
future_row_for_rf_predict <- tibble(
date = current_future_date_rf, # For recipe's ID role (original date)
region = current_forecast_region, # For recipe's ID role
# Core predictors that were in ml_feature_cols
median_price = median_price_fut_rf,
rented_ratio_national = rented_ratio_national_fut_rf,
owner_ratio_national = owner_ratio_national_fut_rf,
free_ratio_national = free_ratio_national_fut_rf,
price_rent_cpi_ratio = price_rent_cpi_ratio_fut_approx_rf,
log_median_price = log_median_price_fut_rf,
price_growth_monthly = price_growth_monthly_fut_rf,
rent_growth_monthly = rent_growth_monthly_fut_rf,
# Date components that the recipe will factorise and dummy
year = year_fut_rf,
month = month_fut_rf,
quarter = quarter_fut_rf,
half_year = half_year_fut_rf
# rent_cpi (outcome) is not included here for prediction
)
# Predicting rent_cpi using my trained RF model.
# Caret's predict() on a 'train' object automatically applies the prepped recipe.
predicted_rent_cpi_value_rf <- predict(active_forecasting_model_rf, newdata = future_row_for_rf_predict)
# Storing the prediction for this month
current_forecast_row_rf <- tibble(date = current_future_date_rf,
region = current_forecast_region,
rent_cpi_forecast = as.numeric(predicted_rent_cpi_value_rf))
rent_cpi_forecasts_rf_list_region[[i]] <- current_forecast_row_rf
# Updating values for the next iteration's lags
rent_cpi_previous_is_two_ago_for_next <- rent_cpi_current_is_previous_for_next
rent_cpi_current_is_previous_for_next <- as.numeric(predicted_rent_cpi_value_rf)
median_price_current_is_previous_for_next <- median_price_fut_rf
} # End of inner forecasting loop (for i in 1:num_forecast_periods) for RF for one region
# Storing the complete RF forecast for the current_forecast_region in my main list
all_regional_rf_cpi_forecasts[[current_forecast_region]] <- bind_rows(rent_cpi_forecasts_rf_list_region)
print(paste("Recursive RF forecasting complete for", current_forecast_region))
# Plotting the RF forecast for the current region
historical_rent_cpi_for_plot_rf <- ml_data_final %>% filter(region == current_forecast_region) %>% select(date, rent_cpi)
current_rf_plot_data <- bind_rows(
historical_rent_cpi_for_plot_rf %>% rename(value = rent_cpi) %>% mutate(type = "Actual"),
all_regional_rf_cpi_forecasts[[current_forecast_region]] %>% rename(value = rent_cpi_forecast) %>% mutate(type = "Forecast (RF Recursive)")
)
if(nrow(current_rf_plot_data) > 0) {
rf_plot <- ggplot(current_rf_plot_data, aes(x = date, y = value, color = type, group = type)) +
geom_line(linewidth = 1) +
scale_colour_manual(values = c("Actual" = "navyblue", "Forecast (RF Recursive)" = "darkgoldenrod1"), name = "Data Type:") +
scale_y_continuous(labels = comma) +
labs(title = paste("Rent CPI RF Forecast:", current_forecast_region),
subtitle = "Recursive using ARIMA for Median Price input", y = "Rent CPI") +
theme(legend.position = "top")
print(rf_plot)
}
} # End else for last_observation_rf_startup check
} # End else for future_median_prices_for_rf_loop check
} # --- END OF THE 'for (current_forecast_region in unique_regions_to_forecast)' LOOP ---## [1] "\n === METHOD 2: Processing Recursive RF Forecast for Rent CPI in Auckland ==="
## [1] "--- Sub-step for Method 2: Forecasting median_price for Auckland using ARIMA ---"
## [1] "ARIMA for median_price (input for RF) in Auckland fitted: "
## [1] "Recursive RF forecasting complete for Auckland"
## [1] "\n === METHOD 2: Processing Recursive RF Forecast for Rent CPI in Wellington ==="
## [1] "--- Sub-step for Method 2: Forecasting median_price for Wellington using ARIMA ---"
## [1] "ARIMA for median_price (input for RF) in Wellington fitted: "
## [1] "Recursive RF forecasting complete for Wellington"
## [1] "\n === METHOD 2: Processing Recursive RF Forecast for Rent CPI in Rest of North Island ==="
## [1] "--- Sub-step for Method 2: Forecasting median_price for Rest of North Island using ARIMA ---"
## [1] "ARIMA for median_price (input for RF) in Rest of North Island fitted: "
## [1] "Recursive RF forecasting complete for Rest of North Island"
## [1] "\n === METHOD 2: Processing Recursive RF Forecast for Rent CPI in Canterbury ==="
## [1] "--- Sub-step for Method 2: Forecasting median_price for Canterbury using ARIMA ---"
## [1] "ARIMA for median_price (input for RF) in Canterbury fitted: "
## [1] "Recursive RF forecasting complete for Canterbury"
## [1] "\n === METHOD 2: Processing Recursive RF Forecast for Rent CPI in Rest of South Island ==="
## [1] "--- Sub-step for Method 2: Forecasting median_price for Rest of South Island using ARIMA ---"
## [1] "ARIMA for median_price (input for RF) in Rest of South Island fitted: "
## [1] "Recursive RF forecasting complete for Rest of South Island"
## [1] "\n === METHOD 2: Processing Recursive RF Forecast for Rent CPI in National ==="
## [1] "--- Sub-step for Method 2: Forecasting median_price for National using ARIMA ---"
## [1] "ARIMA for median_price (input for RF) in National fitted: "
## [1] "Recursive RF forecasting complete for National"
# Now that the loop has finished for all specified regions, I can proceed to the combined plot.
print("--- All Regional Forecasting Attempts (Method 1 & 2) Completed ---")## [1] "--- All Regional Forecasting Attempts (Method 1 & 2) Completed ---"
1. Auckland:
The ARIMA model (console previously indicated ARIMA(0,1,1)(0,1,1)[12]
with drift) projects Auckland’s median price to increase from its early
2020 level (~$900K-$1M) at a subdued rate, reaching roughly $1.1M by
2030.
Uncertainty: Prediction intervals are very wide.
2. Canterbury:
The ARIMA model (plot subtitle indicates ARIMA(0,1,0)(0,1,1)[12] with
drift) projects Canterbury’s median price to continue its upward trend
from ~$450K-$500K in early 2020, reaching around $650K-$700K by
2030.
Uncertainty: Prediction intervals are wide and
expand.
3. National:
The ARIMA model (plot subtitle indicates ARIMA(0,1,2)(0,1,1)[12] with
drift) projects the national median price to increase from ~$600K-$700K
in early 2020 to around $850K-$900K by 2030, showing a continued but
modest upward trend.
Uncertainty: Prediction intervals are substantial and
widen.
4. Rest of North Island:
The ARIMA model (plot subtitle indicates ARIMA(0,1,0)(0,1,1)[12] with
drift) projects a strong and continuous upward trend for median prices,
rising from ~$500K-$600K in early 2020 to potentially over $1.5M by
2030.
Uncertainty: Prediction intervals are extremely
wide.
5. Rest of South Island:
The ARIMA model (plot subtitle indicates ARIMA(0,1,1)(0,1,1)[12] with
drift) projects median prices to continue increasing from ~$450K in
early 2020 to around $700K-$750K by 2030.
Uncertainty: Prediction intervals are wide.
6. Wellington:
The ARIMA model (plot subtitle indicates ARIMA(1,1,0)(0,1,1)[12] with
drift) projects a significant and continued upward trend for
Wellington’s median price, from ~$700K-$800K in early 2020 to
potentially over $1.6M by 2030.
Uncertainty: Prediction intervals are very wide.
1. Auckland:
Rent CPI RF Forecast: The forecast (orange line) shows
an initial slight dip/flattening from early 2020 levels (~1500-1550
CPI), then maintains a very gentle upward trend with minor fluctuations,
remaining largely stable around 1500-1550 through to 2030. This suggests
a significant slowdown compared to its strong historical trend and
contrasts with the more aggressive direct ARIMA forecast for Rent CPI
(Method 1).
Insight (Auckland): The RF model, influenced by a
decelerating median price forecast, projects a stabilization of Rent
CPI.
2. Canterbury:
Rent CPI RF Forecast: The forecast shows considerable
volatility but generally an upward trend, rising from ~1400-1450 in
early 2020 to around 1500-1550 by 2030. The forecast path is much less
smooth than for other regions. This suggests a continued, albeit
somewhat erratic, increase in Rent CPI.
Insight (Canterbury): The RF forecast suggests
continued, but volatile, Rent CPI growth, influenced by a moderately
increasing median price projection.
3. National:
Rent CPI RF Forecast: The forecast shows the national
Rent CPI remaining very stable, hovering around the 1500-1520 level
throughout the forecast period to 2030, with minor fluctuations. This
indicates a significant flattening compared to the strong historical
upward trend.
Insight (National): The RF model projects a
stabilization of the national Rent CPI, likely reflecting the more
moderate growth projected for national median prices.
4. Rest of North Island:
Rent CPI RF Forecast: The forecast shows an initial
dip, then a period of relative stability around the 1500-1550 CPI level,
with some fluctuations but no strong upward trend through to 2030. This
is a surprising outcome given the very strong upward projection for its
median price input.
Insight (Rest of North Island): Despite a very strong
upward median price forecast, the RF model projects a relatively flat
Rent CPI. This divergence suggests complex model interactions or
limitations.
5. Rest of South Island:
Rent CPI RF Forecast: The forecast shows an initial
period of volatility, then stabilizes around the 1450-1500 CPI level,
with some fluctuations but no clear strong upward trend through to 2030.
This indicates a stabilization after recent historical volatility.
Insight (Rest of South Island): The RF model projects a
stabilization of Rent CPI, influenced by a moderately increasing median
price forecast.
6. Wellington:
Rent CPI RF Forecast: The forecast shows Rent CPI
remaining relatively stable around the 1500-1550 mark, with some
fluctuations, after an initial slight dip from its early 2020 peak.
Similar to “Rest of North Island,” this relatively flat Rent CPI
forecast is counterintuitive given the strong upward projection for its
median price input.
Insight (Wellington): Despite a strong upward median
price forecast, the RF model projects a relatively stable Rent CPI,
suggesting the model is not solely driven by the price input in this
scenario.
General Trend - Stabilization/Modest Growth:
Unlike the direct ARIMA forecasts for Rent CPI (Method 1) which often
showed strong continued upward trends, the recursive Random Forest
forecasts (Method 2) generally project a period of stabilization or very
modest growth for Rent CPI across most regions (Auckland, National, Rest
of North Island, Rest of South Island, Wellington). Canterbury is
somewhat an exception, showing more volatile but still upwardly trended
growth.
Influence of Median Price Forecasts:
The Rent CPI forecasts from the RF model are inherently linked to the
“helper” ARIMA forecasts for median property prices. Regions where
median prices were forecasted to grow more slowly or stabilize (e.g.,
Auckland, National) also saw flatter RF Rent CPI forecasts. However, for
regions with very aggressive median price forecasts (e.g., Rest of North
Island, Wellington), the RF Rent CPI forecast did not always follow suit
with similar aggression, often showing stabilization instead. This might
indicate:
price_rent_cpi_ratio using lagged
Rent CPI might dampen the responsiveness.Uncertainty in Inputs:
The significant uncertainty (wide prediction intervals) in the helper
ARIMA forecasts for median prices is a major underlying source of
uncertainty for these RF Rent CPI forecasts, even though explicit
prediction intervals are not generated by this recursive RF method.
Volatility in Forecasts:
Some regions, like Canterbury and Rest of South Island, show more
inherent volatility in their RF Rent CPI forecasts, likely reflecting
the historical volatility in their data that the RF model attempts to
capture or is influenced by.
Now that I have generated Rent CPI forecasts using two different methods (direct ARIMA for Rent CPI, and a recursive Random Forest model using ARIMA-forecasted median prices), I think it is very useful to plot these different forecasts together on the same chart, along with the actual historical data. This will allow me to directly compare their trajectories, levels, and how they perform relative to each other for the selected region(s).
If I run the forecasting loop for multiple regions, this combined plot will be faceted, making it easy to see these comparisons across different parts of New Zealand.
# This chunk assumes 'all_regional_arima_cpi_forecasts' and 'all_regional_rf_cpi_forecasts'
# lists have been populated by the preceding regional_forecasting_loop_chunk.
# It also uses 'full_data' for historical actuals and 'unique_regions_to_forecast' for filtering.
print("--- Generating Combined Forecast Comparison Plot ---")## [1] "--- Generating Combined Forecast Comparison Plot ---"
# This plot will only be generated if forecasts were successfully created for at least one region by at least one method.
# The condition also checks if 'unique_regions_to_forecast' is defined.
if (exists("unique_regions_to_forecast") && length(unique_regions_to_forecast) > 0 &&
((exists("all_regional_arima_cpi_forecasts") && length(all_regional_arima_cpi_forecasts) > 0) ||
(exists("all_regional_rf_cpi_forecasts") && length(all_regional_rf_cpi_forecasts) > 0)) ) {
# Binding all the ARIMA forecasts together (if any were successful)
final_arima_forecasts_all_regions <- bind_rows(all_regional_arima_cpi_forecasts)
# Binding all the Random Forest forecasts together (if any were successful)
final_rf_forecasts_all_regions <- bind_rows(all_regional_rf_cpi_forecasts)
# Proceed only if there is something to plot from at least one forecasting method.
if(nrow(final_arima_forecasts_all_regions) > 0 || nrow(final_rf_forecasts_all_regions) > 0) {
# Preparing historical data for all regions that we attempted to forecast.
historical_data_all_regions_plot <- full_data %>%
filter(region %in% unique_regions_to_forecast) %>%
select(date, region, rent_cpi) %>%
rename(value = rent_cpi) %>%
mutate(method = "Actual", type = "Actual") # Adding columns for easy binding and plotting
# Preparing ARIMA forecast data for plotting, if it exists.
arima_plot_df <- if(nrow(final_arima_forecasts_all_regions) > 0) {
final_arima_forecasts_all_regions %>%
# Selecting only necessary columns and renaming for consistency before bind_rows.
# Also including prediction interval columns if I want to plot them.
select(date, region,
value = rent_cpi_forecast_arima,
lower95 = lower95_arima, # Taking 95% PI
upper95 = upper95_arima) %>%
mutate(method = "ARIMA Direct", type = "Forecast (ARIMA)")
} else {
# Return an empty tibble with the correct structure if no ARIMA forecasts.
tibble(date=as.Date(character(0)), region=character(0), value=numeric(0),
lower95=numeric(0), upper95=numeric(0), method=character(0), type=character(0))
}
# Preparing Random Forest forecast data for plotting, if it exists.
rf_plot_df <- if(nrow(final_rf_forecasts_all_regions) > 0) {
final_rf_forecasts_all_regions %>%
rename(value = rent_cpi_forecast) %>%
mutate(method = "RF Recursive", type = "Forecast (RF Recursive)") %>%
# Adding empty PI columns for consistent structure with ARIMA df, though RF doesn't provide them directly here.
mutate(lower95 = NA_real_, upper95 = NA_real_) %>%
select(date, region, value, type, method, lower95, upper95)
} else {
tibble(date=as.Date(character(0)), region=character(0), value=numeric(0),
lower95=numeric(0), upper95=numeric(0), method=character(0), type=character(0))
}
# Combining all data: actuals, ARIMA forecasts, and RF forecasts.
comparison_plot_data_all <- bind_rows(
historical_data_all_regions_plot,
arima_plot_df,
rf_plot_df
) %>%
filter(!is.na(value)) # Ensuring no NA values interfere with plotting lines.
if(nrow(comparison_plot_data_all) > 0){
# Ensuring 'type' is a factor for consistent plotting order and legend.
type_levels <- c("Actual") # Actual is always first.
if(nrow(arima_plot_df) > 0 && !"Forecast (ARIMA)" %in% type_levels) type_levels <- c(type_levels, "Forecast (ARIMA)")
if(nrow(rf_plot_df) > 0 && !"Forecast (RF Recursive)" %in% type_levels) type_levels <- c(type_levels, "Forecast (RF Recursive)")
comparison_plot_data_all$type <- factor(comparison_plot_data_all$type, levels = type_levels, ordered = TRUE)
# Creating the comparison plot.
all_regions_forecast_plot <- ggplot(comparison_plot_data_all, aes(x = date, y = value, colour = type, linetype = type)) +
geom_line(linewidth = 0.8) + # Line for actuals and mean forecasts
# Adding prediction intervals ONLY for ARIMA if they exist
geom_ribbon(data = filter(comparison_plot_data_all, type == "Forecast (ARIMA)" & !is.na(lower95) & !is.na(upper95)),
aes(ymin=lower95, ymax=upper95, x=date),
fill="deepskyblue3", alpha=0.15, inherit.aes=FALSE, show.legend = FALSE) +
# Faceting by region. Adjust ncol based on how many regions I am forecasting.
facet_wrap(~region, scales = "free_y", ncol = min(3, length(unique(comparison_plot_data_all$region)))) +
scale_colour_manual(values = c("Actual" = "black", "Forecast (ARIMA)" = "deepskyblue3", "Forecast (RF Recursive)" = "orangered2"),
name = "Data/Method:", drop = FALSE) + # drop=FALSE keeps all levels in legend even if one method failed for a region.
scale_linetype_manual(values = c("Actual" = "solid", "Forecast (ARIMA)" = "dashed", "Forecast (RF Recursive)" = "dotdash"),
name = "Data/Method:", drop = FALSE) +
scale_y_continuous(labels = comma) +
labs(title = "Rent CPI Forecast Comparison Across Selected Regions",
subtitle = "ARIMA Direct vs. Recursive Random Forest (RF uses ARIMA-forecasted Median Price input)",
x = "Date", y = "Rent CPI") +
theme(legend.position = "top", legend.key.width = unit(1.5, "cm"),
strip.text = element_text(size=rel(0.8))) # Adjusting facet label size.
print(all_regions_forecast_plot)
} else {
print("Not enough data to generate combined forecast plot after preparing forecast dataframes.")
}
} else {
print("Could not generate combined forecast plot; no forecast data from any method for any region.")
}
} else {
print("Combined plot for multiple regions skipped as conditions not met (e.g., only one region processed by a single method, or forecast data missing).")
}Plot 7.3: Comparison of Rent CPI Forecasts from Different Methods.
This analysis compares the Rent CPI forecasts generated by two distinct methods:
Method 1: Direct ARIMA (dashed blue line with shaded 95% prediction intervals) - Univariate forecast based only on historical Rent CPI.
Method 2: Recursive Random Forest (RF) (dot-dash orange line) - Multivariate forecast using ARIMA-forecasted median prices and other predictors.
The plots show historical actual Rent CPI (solid black line) followed by the forecasts from both methods for each of the six broad New Zealand regions, extending to the end of 2030.
Auckland
ARIMA Direct Forecast: Projects a continued strong
upward trend for Rent CPI, rising from ~1550 to ~1900-2000 by 2030. The
95% prediction interval (light blue shading) becomes very wide,
indicating high uncertainty.
RF Recursive Forecast: Shows a significant departure,
projecting Rent CPI to stabilize or grow very modestly, remaining around
the 1500-1550 level throughout the forecast horizon.
Comparison: The two methods offer starkly different
outlooks for Auckland. The ARIMA model extrapolates the strong
historical trend, while the RF model, likely influenced by its more
subdued median price input forecast, suggests a flattening.
Insight (Auckland): Significant divergence between
methods; ARIMA projects continued growth, RF projects stabilization. The
RF forecast implies a structural break from past Rent CPI growth trends,
driven by its interpretation of future predictor paths.
Canterbury
ARIMA Direct Forecast: Projects a gentle upward trend
from ~1400-1450 to around 1600-1700 by 2030, with very wide prediction
intervals.
RF Recursive Forecast: Also shows an upward trend, but
with more pronounced volatility and a slightly higher trajectory,
reaching around 1500-1550 by 2030, but with notable month-to-month
fluctuations.
Comparison: Both models project an increase, but the RF
forecast is more volatile. The ARIMA point forecast falls within the
general path of the RF forecast initially but then appears slightly more
optimistic in the long run.
Insight (Canterbury): Both methods suggest Rent CPI
growth, but the RF model captures more short-term volatility. The
long-term ARIMA trend is slightly more bullish than the central path of
the RF’s fluctuations.
National
ARIMA Direct Forecast: Extrapolates the strong, steady
historical upward trend, projecting Rent CPI to rise from ~1520 towards
2000 by 2030. Prediction intervals widen significantly.
RF Recursive Forecast: Projects a clear stabilization
of the national Rent CPI, hovering around the 1500-1520 level with minor
fluctuations.
Comparison: Similar to Auckland, there’s a major
divergence. ARIMA continues the historical growth, while RF predicts a
flattening of the national trend.
Insight (National): The two methods provide contrasting
national outlooks. RF suggests the national Rent CPI growth seen
historically will not continue at the same pace, likely due to its more
conservative median price input forecast.
Rest of North Island
ARIMA Direct Forecast: Projects a very strong, almost
exponential, continuation of the accelerating upward trend, with Rent
CPI forecasted to rise dramatically (note the y-axis scale goes up to
3,000,000, which seems to be an error in the plot’s y-axis scaling for
this facet, the actual forecast value would be much lower, likely aiming
for >2500 based on previous plots for this method). The line itself
shows a very steep increase.
RF Recursive Forecast: Shows an initial dip from the
~1600 level, then stabilizes around the 1500-1550 mark with some
fluctuations.
Comparison: Extreme divergence. The ARIMA forecast is
exceptionally bullish (though the y-axis scaling in the plot is
problematic for interpretation of the absolute level, the trajectory is
clear). The RF forecast, despite a very strong median price input
forecast for this region (as seen in Method 2 analysis), still results
in a relatively flat Rent CPI projection.
Insight (Rest of North Island): The ARIMA forecast
appears unrealistic due to its extreme extrapolation (and potential
plotting scale issue). The RF forecast suggests stabilization,
indicating the model is not translating the aggressive median price
input into similarly aggressive Rent CPI growth.
Rest of South Island
ARIMA Direct Forecast: Projects a largely flat trend
around the 1300-1350 level, influenced by the recent sharp drop in
historical data. Prediction intervals are extremely wide, indicating
massive uncertainty.
RF Recursive Forecast: Shows initial volatility then
stabilizes around the 1450-1500 level, slightly higher than the ARIMA
forecast, with ongoing fluctuations.
Comparison: Both methods project a relatively flat
future, but the RF forecast stabilizes at a slightly higher level after
some initial volatility. The ARIMA forecast is heavily anchored by the
recent downturn.
Insight (Rest of South Island): Both models suggest an
end to the strong historical growth, but the RF model is less
pessimistic than the ARIMA model, which is heavily influenced by the
latest data points. High uncertainty remains a key feature for
ARIMA.
Wellington
ARIMA Direct Forecast: Shows a continued strong,
somewhat cyclical/seasonal upward trend, with Rent CPI projected to rise
significantly (note the y-axis scale goes up to 60,000, which is another
plotting scale issue; the actual forecast would be aiming for >2500
based on previous plots for this method). The trajectory is steeply
upward.
RF Recursive Forecast: After an initial dip from its
early 2020 peak (above 1600), the RF forecast stabilizes around the
1500-1550 level with fluctuations.
Comparison: Significant divergence. The ARIMA forecast
is very bullish (again, y-axis scaling issue). The RF forecast, despite
a strong median price input forecast for Wellington, projects
stabilization of Rent CPI.
Insight (Wellington): Similar to “Rest of North
Island,” the ARIMA extrapolation seems extreme. The RF model predicts
stabilization, not directly mirroring the aggressive median price input,
suggesting other factors or model learning are at play.
This project has been a comprehensive journey into analysing the New Zealand housing market, focusing on understanding and predicting Rent CPI. Starting with raw data, I went through a detailed process of data cleaning, standardisation, and integration, which was particularly challenging due to the different regional granularities and time frequencies of the source files. The Exploratory Data Analysis (EDA) phase, with its ~15 visualisations, helped me uncover various trends, distributions, and relationships within the housing data. Following that, I engineered several new features which I believed would enhance the predictive power of my models.
full_data) suitable for analysis.recipe including imputation,
dummy variable creation, normalisation, and handling of correlated
predictors.median_price as a key input. These initial forecasts
provided a baseline and highlighted the complexities of long-term
prediction, especially regarding assumptions for exogenous
variables.While I am pleased with the progress, I also recognise some limitations in the current analysis:
median_price were
based on a univariate ARIMA forecast, and national dwelling ratios were
held constant. More sophisticated forecasting for these key inputs would
likely improve the Rent CPI forecast accuracy.price_rent_cpi_ratio Approximation: In
the recursive RF forecast, the price_rent_cpi_ratio feature
(which originally used contemporaneous rent_cpi) was
approximated using the previous month’s rent_cpi.
Retraining the RF model with a truly lagged version of this feature
would be more robust.This project has laid a strong foundation, and there are several exciting avenues I am keen to explore next to build upon this work:
median_price for each region.lag(rent_cpi) in the price-to-rent
ratio).rent_cpi.keras or torch in R.Overall, this project has been an invaluable learning experience in end-to-end data analysis, from data integration and EDA through to advanced modelling and the initial stages of forecasting. I am excited about the potential to further refine these models and generate even more insightful predictions about the New Zealand housing market.
Thanks for taking the time to go through my property market analysis! I really enjoyed digging into this data, building the models, and exploring the forecasting side of things. It has been a fantastic learning journey.
If you would like to see more of my projects or connect professionally, you can find me here:
Cheers!